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Preface 


This edition provides a rather substantial addition to the material covered in 
the first edition. The principal difference is the inclusion of three new 
chapters, Chapters 10, 11, and 12, in addition to an appendix of solutions to 
exercises. 

Chapter 10 covers orthogonal polynomials, such as Legendre, Chebyshev, 
Jacobi, Laguerre, and Hermite polynomials, and discusses their applications 
in statistics. Chapter 11 provides a thorough coverage of Fourier series. The 
presentation is done in such a way that a reader with no prior knowledge of 
Fourier series can have a clear understanding of the theory underlying the 
subject. Several applications of Fouries series in statistics are presented. 
Chapter 12 deals with approximation of Riemann integrals. It gives an 
exposition of methods for approximating integrals, including those that are 
multidimensional. Applications of some of these methods in statistics 
are discussed. This subject area has recently gained prominence in several 
fields of science and engineering, and, in particular, Bayesian statistics. The 
material should be helpful to readers who may be interested in pursuing 
further studies in this area. 

A significant addition is the inclusion of a major appendix that gives 
detailed solutions to the vast majority of the exercises in Chapters 1-12. This 
supplement was prepared in response to numerous suggestions by users of 
the first edition. The solutions should also be helpful in getting a better 
understanding of the various topics covered in the book. 

In addition to the aforementioned material, several new exercises were 
added to some of the chapters in the first edition. Chapter 1 was expanded by 
the inclusion of some basic topological concepts. Chapter 9 was modified to 
accommodate Chapter 10. The changes in the remaining chapters, 2 through 
8, are very minor. The general bibliography was updated. 

The choice of the new chapters was motivated by the evolution of the field 
of statistics and the growing needs of statisticians for mathematical tools 
beyond the realm of advanced calculus. This is certainly true in topics 
concerning approximation of integrals and distribution functions, stochastic 
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processes, time series analysis, and the modeling of periodic response func- 
tions, to name just a few. 

The book is self-contained. It can be used as a text for a two-semester 
course in advanced calculus and introductory mathematical analysis. Chap- 
ters 1-7 may be covered in one semester, and Chapters 8-12 in the other 
semester. With its coverage of a wide variety of topics, the book can also 
serve as a reference for statisticians, and others, who need an adequate 
knowledge of mathematics, but do not have the time to wade through the 
myriad mathematics books. It is hoped that the inclusion of a separate 
section on applications in statistics in every chapter will provide a good 
motivation for learning the material in the book. This represents a continua- 
tion of the practice followed in the first edition. 

As with the first edition, the book is intended as much for mathematicians 
as for statisticians. It can easily be turned into a pure mathematics book by 
simply omitting the section on applications in statistics in a given chapter. 
Mathematicians, however, may find the sections on applications in statistics 
to be quite useful, particularly to mathematics students seeking an interdisci- 
plinary major. Such a major is becoming increasingly popular in many circles. 
In addition, several topics are included here that are not usually found in a 
typical advanced calculus book, such as approximation of functions and 
integrals, Fourier series, and orthogonal polynomials. The fields of mathe- 
matics and statistics are becoming increasingly intertwined, making any 
separation of the two unpropitious. The book represents a manifestation of 
the interdependence of the two fields. 

The mathematics background needed for this edition is the same as for 
the first edition. For readers interested in statistical applications, a back- 
ground in introductory mathematical statistics will be helpful, but not abso- 
lutely essential. The annotated bibliography in each chapter can be consulted 
for additional readings. 

I am grateful to all those who provided comments and helpful suggestions 
concerning the first edition, and to my wife Ronnie for her help and support. 


ANDRE I. KHURI 


Gainesville, Florida 


Preface to the First Edition 


The most remarkable mathematical achievement of the seventeenth century 
was the invention of calculus by Isaac Newton (1642-1727) and Gottfried 
Wilhelm Leibniz (1646-1716). It has since played a significant role in all 
fields of science, serving as its principal quantitative language. There is hardly 
any scientific discipline that does not require a good knowledge of calculus. 
The field of statistics is no exception. 

Advanced calculus has had a fundamental and seminal role in the devel- 
opment of the basic theory underlying statistical methodology. With the rapid 
growth of statistics as a discipline, particularly in the last three decades, 
knowledge of advanced calculus has become imperative for understanding 
the recent advances in this field. Students as well as research workers in 
statistics are expected to have a certain level of mathematical sophistication 
in order to cope with the intricacies necessitated by the emerging of new 
statistical methodologies. 

This book has two purposes. The first is to provide beginning graduate 
students in statistics with the basic concepts of advanced calculus. A high 
percentage of these students have undergraduate training in disciplines other 
than mathematics with only two or three introductory calculus courses. They 
are, in general, not adequately prepared to pursue an advanced graduate 
degree in statistics. This book is designed to fill the gaps in their mathemati- 
cal training and equip them with the advanced calculus tools needed in their 
graduate work. It can also provide the basic prerequisites for more advanced 
courses in mathematics. 

One salient feature of this book is the inclusion of a complete section in 
each chapter describing applications in statistics of the material given in the 
chapter. Furthermore, a large segment of Chapter 8 is devoted to the 
important problem of optimization in statistics. The purpose of these applica- 
tions is to help motivate the learning of advanced calculus by showing its 
relevance in the field of statistics. There are many advanced calculus books 
designed for engineers or business majors, but there are none for statistics 
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majors. This is the first advanced calculus book to emphasize applications in 
statistics. 

The scope of this book is not limited to serving the needs of statistics 
graduate students. Practicing statisticians can use it to sharpen their mathe- 
matical skills, or they may want to keep it as a handy reference for their 
research work. These individuals may be interested in the last three chapters, 
particularly Chapters 8 and 9, which include a large number of citations of 
Statistical papers. 

The second purpose of the book concerns mathematics majors. The book’s 
thorough and rigorous coverage of advanced calculus makes it quite suitable 
as a text for juniors or seniors. Chapters 1 through 7 can be used for this 
purpose. The instructor may choose to omit the last section in each chapter, 
which pertains to statistical applications. Students may benefit, however, 
from the exposure to these additional applications. This is particularly true 
given that the trend today is to allow the undergraduate student to have a 
major in mathematics with a minor in some other discipline. In this respect, 
the book can be particularly useful to those mathematics students who may 
be interested in a minor in statistics. 

Other features of this book include a detailed coverage of optimization 
techniques and their applications in statistics (Chapter 8), and an introduc- 
tion to approximation theory (Chapter 9). In addition, an annotated bibliog- 
raphy is given at the end of each chapter. This bibliography can help direct 
the interested reader to other sources in mathematics and statistics that are 
relevant to the material in a given chapter. A general bibliography is 
provided at the end of the book. There are also many examples and exercises 
in mathematics and statistics in every chapter. The exercises are classified by 
discipline (mathematics and statistics) for the benefit of the student and the 
instructor. 

The reader is assumed to have a mathematical background that is usually 
obtained in the freshman-sophomore calculus sequence. A prerequisite for 
understanding the statistical applications in the book is an introductory 
Statistics course. Obviously, those not interested in such applications need 
not worry about this prerequisite. Readers who do not have any background 
in statistics, but are nevertheless interested in the application sections, can 
make use of the annotated bibliography in each chapter for additional 
reading. 

The book contains nine chapters. Chapters 1-7 cover the main topics in 
advanced calculus, while chapters 8 and 9 include more specialized subject 
areas. More specifically, Chapter 1 introduces the basic elements of set 
theory. Chapter 2 presents some fundamental concepts concerning vector 
spaces and matrix algebra. The purpose of this chapter is to facilitate the 
understanding of the material in the remaining chapters, particularly, in 
Chapters 7 and 8. Chapter 3 discusses the concepts of limits and continuity of 
functions. The notion of differentiation is studied in Chapter 4. Chapter 5 
covers the theory of infinite sequences and series. Integration of functions is 
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the theme of Chapter 6. Multidimensional calculus is introduced in Chapter 
7. This chapter provides an extension of the concepts of limits, continuity, 
differentiation, and integration to functions of several variables (multivaria- 
ble functions). Chapter 8 consists of two parts. The first part presents an 
overview of the various methods of optimization of multivariable functions 
whose optima cannot be obtained explicitly by standard advanced calculus 
techniques. The second part discusses a variety of topics of interest to 
statisticians. The common theme among these topics is optimization. Finally, 
Chapter 9 deals with the problem of approximation of continuous functions 
with polynomial and spline functions. This chapter is of interest to both 
mathematicians and statisticians and contains a wide variety of applications 
in statistics. 

I am grateful to the University of Florida for granting me a sabbatical 
leave that made it possible for me to embark on the project of writing this 
book. I would also like to thank Professor Rocco Ballerini at the University 
of Florida for providing me with some of the exercises used in Chapters, 3, 4, 
5, and 6. 


ANDRE I. KHURI 


Gainesville, Florida 


CHAPTER 1 


An Introduction to Set Theory 


The origin of the modern theory of sets can be traced back to the Russian-born 
German mathematician Georg Cantor (1845-1918). This chapter introduces 
the basic elements of this theory. 


1.1. THE CONCEPT OF A SET 


A set is any collection of well-defined and distinguishable objects. These 
objects are called the elements, or members, of the set and are denoted by 
lowercase letters. Thus a set can be perceived as a collection of elements 
united into a single entity. Georg Cantor stressed this in the following words: 
“A set is a multitude conceived of by us as a one.” 

If x is an element of a set A, then this fact is denoted by writing x <A. 
If, however, x is not an element of A, then we write x A. Curly brackets 
are usually used to describe the contents of a set. For example, if a set A 
consists of the elements x,,x,,...,x,, then it can be represented as A = 
{x,,%5,-.-,*,}. In the event membership in a set is determined by the 
satisfaction of a certain property or a relationship, then the description of the 
same can be given within the curly brackets. For example, if A consists of all 
real numbers x such that x? > 1, then it can be expressed as A = {x|x? > 1}, 
where the bar | is used simply to mean “such that.” The definition of sets in 
this manner is based on the axiom of abstraction, which states that given any 
property, there exists a set whose elements are just those entities having that 


property. 


Definition 1.1.1. The set that contains no elements is called the empty set 
and is denoted by Ø. o 


Definition 1.1.2. A set A is a subset of another set B, written symboli- 
cally as A CB, if every element of A is an element of B. If B contains at 
least one element that is not in A, then A is said to be a proper subset of B. 

Oo 
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2 AN INTRODUCTION TO SET THEORY 


Definition 1.1.3. A set A and a set B are equal if ACB and BCA. 
Thus, every element of A is an element of B and vice versa. Oo 


Definition 1.1.4. The set that contains all sets under consideration in a 
certain study is called the universal set and is denoted by ©. o 


1.2. SET OPERATIONS 


There are two basic operations for sets that produce new sets from existing 
ones. They are the operations of union and intersection. 


Definition 1.2.1. The union of two sets A and B, denoted by A U B, is 
the set of elements that belong to either A or B, that is, 


AUB={x\|x €A or x €B}. m 


This definition can be extended to more than two sets. For example, if 
A,,A,,...,A, are n given sets, then their union, denoted by U j_, A,, is a set 
such that x is an element of it if and only if x belongs to at least one of the 
A; G=1,2,...,7). 


Definition 1.2.2. The intersection of two sets A and B, denoted by 
A N B, is the set of elements that belong to both A and B. Thus 


ANB = {x|x €A and x € B}. m 


This definition can also be extended to more than two sets. As before, if 
A,, A,,...,A,, are n given sets, then their intersection, denoted by N 7_, 4;, 
is the set consisting of all elements that belong to all the A; (i = 1,2,..., n). 


Definition 1.2.3. Two sets A and B are disjoint if their intersection is the 
empty set, that is, AN B = Ø. o 


Definition 1.2.4. The complement of a set A, denoted by A, is the set 
consisting of all elements in the universal set that do not belong to A. In 
other words, x €A if and only if x ZA. 

The complement of A with respect to a set B is the set B—A which 
consists of the elements of B that do not belong to A. This complement is 
called the relative complement of A with respect to B. o 


From Definitions 1.1.1-1.1.4 and 1.2.1-1.2.4, the following results can be 
concluded: 


RESULT 1.2.1. The empty set © is a subset of every set. To show this, 
suppose that A is any set. If it is false that @ CA, then there must be an 
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element in Ø which is not in A. But this is not possible, since © is empty. It 
is therefore true that Ø CA. 


RESULT 1.2.2. The empty set Ø is unique. To prove this, suppose that © 
and Ø, are two empty sets. Then, by the previous result, © cØ, and 
©, = Ø. Hence, © = Dp. 


RESULT 1.2.3. The complement of Ø is Q. Vice versa, the complement 
of Q is Ø. 


RESULT 1.2.4. The complement of A is A. 

RESULT 1.2.5. For any set A, AUA =Q and ANA =Ø. 
RESULT 1.2.6. A-—B=A—ANB. 

RESULT 1.2.7. AU(BUC)=(AUB)UC. 

RESULT 1.2.8. AN(BOC)=(ANB)NC. 

RESULT 1.2.9. AU(BONC)=(AUB)N(AUC). 
RESULT 1.2.10. AN(BUC)=(ANB)U(ANC). 


RESULT 1.2.11. (A UB)=AQNB. More generally, U7_,4,;= N %14; 


RESULT 1.2.12. (AN B)=A UB. More generally, N% 4;= U %14, 
Definition 1.2.5. Let A and B be two sets. Their Cartesian product, 


denoted by A x B, is the set of all ordered pairs (a, b) such that a €A and 
b € B, that is, 


AXB={(a,b)ļa EA and b € B}. 


The word “ordered” means that if a and c are elements in A and b and d 
are elements in B, then (a, b) = (c, d) if and only if a =c and b =d. o 


The preceding definition can be extended to more than two sets. For 
example, if A,, A2,..., A„ are n given sets, then their Cartesian product is 
denoted by X;_; A; and defined by 


n 
X Aj = {as Gps) 0, EA; FH V2 eg} 
i=1 
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Here, (a,,45,...,a,,), called an ordered n-tuple, represents a generaliza- 
tion of the ordered pair. In particular, if the A; are equal to A for 
i=1,2,...,n, then one writes A” for X/_, A. 


The following results can be easily verified: 
RESULT 1.2.13. AX B= if and only if A = Ø or B = Ø. 
RESULT 1.2.14. (AUB)XC=(AXC)U(BXC). 


RESULT 1.2.15. (ANB)XC=(AXC)N(BXC). 


RESULT 1.2.16. (A XB)N(C X¥D)=(ANC)X(BND). 


1.3. RELATIONS AND FUNCTIONS 
Let A XB be the Cartesian product of two sets, A and B. 


Definition 1.3.1. A relations p from A to B isa subset of A X B, that is, 
p consists of ordered pairs (a, b) such that a EA and b € B. In particular, if 
A=B, then p is said to be a relation in A. 

For example, if A ={7,8,9} and B={7,8,9,10}, then p={(a, b)|a <b, 
a EA, b € B} is a relation from A to B that consists of the six ordered pairs 
(7, 8), (7,9), (7, 10), (8,9), (8, 10), and (9, 10). 

Whenever p is a relation and (x,y) €p, then x and y are said to be 
p-related. This is denoted by writing x py. Oo 


Definition 1.3.2. A relation p ina set A is an equivalence relation if the 
following properties are satisfied: 


1. p is reflexive, that is, apa for any a in A. 
2. p is symmetric, that is, if apb, then bpa for any a,b in A. 
3. p is transitive, that is, if apb and bpc, then apc for any a,b,c in A. 


If p is an equivalence relation in a set A, then for a given ay in A, the set 
C(ay) = {a € Alay pa), 


which consists of all elements of A that are p-related to ao, is called an 
equivalence class of ag. o 


RESULT 1.3.1. a€ C(a) for any a in A. Thus each element of A is an 
element of an equivalence class. 
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RESULT 1.3.2. If C(a,) and C(a,) are two equivalence classes, then 
either C(a,) = C(a,), or C(a,) and C(a,) are disjoint subsets. 


It follows from Results 1.3.1 and 1.3.2 that if A is a nonempty set, the 
collection of distinct p-equivalence classes of A forms a partition of A. 

As an example of an equivalence relation, consider that a p b if and only if 
a and b are integers such that a — b is divisible by a nonzero integer n. This 
is the relation of congruence modulo n in the set of integers and is written 
symbolically as a=b (modn). Clearly, a=a (mod n), since a—a=0 is 
divisible by n. Also, if a =b (mod n), then b =a (mod n), since if a—b is 
divisible by n, then so is b—a. Furthermore, if a =b (modn) and b=c 
(mod n), then a =c (mod n). This is true because if a — b and b — c are both 
divisible by n, then so is (a—b)+(b—c)=a-—c. Now, if ag is a given 
integer, then a p-equivalence class of a, consists of all integers that can be 
written as a =a, + kn, where k is an integer. This in this example C(a,) is 
the set {ag + kn|k €J}, where J denotes the set of all integers. 


Definition 1.3.3. Let p be a relation from A to B. Suppose that p has 
the property that for all x in A, if xpy and xpz, where y and z are elements 
in B, then y =z. Such a relation is called a function. Oo 


Thus a function is a relation p such that any two elements in B that are 
p-related to the same x in A must be identical. In other words, to each 
element x in A, there corresponds only one element y in B. We call y the 
value of the function at x and denote it by writing y =/f(x). The set A is 
called the domain of the function f, and the set of all values of f(x) for x in 
A is called the range of f, or the image of A under f, and is denoted by 
f(A). In this case, we say that f is a function, or a mapping, from A into B. 
We express this fact by writing f: A > B. Note that f(A) is a subset of B. In 
particular, if B = f(A), then f is said to be a function from A onto B. In this 
case, every element b in B has a corresponding element a in A such that 


b=f(a). 


Definition 1.3.4. A function f defined on a set A is said to be a 
one-to-one function if whenever f(x,)=f(x,) for x,,x, in A, one has 
xı =x,. Equivalently, f is a one-to-one function if whenever x, #x,, one has 


f(x) #f(x,). o 


Thus a function f: A—B is one-to-one if to each y in f(A), there 
corresponds only one element x in A such that y = f(x). In particular, if f is 
a one-to-one and onto function, then it is said to provide a one-to-one 
correspondence between A and B. In this case, the sets A and B are said to 
be equivalent. This fact is denoted by writing A ~ B. 

Note that whenever A ~B, there is a function g: B—A such that if 
y = f(x), then x = g(y). The function g is called the inverse function of f and 
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is denoted by f~". It is easy to see that A~B defines an equivalence 
relation. Properties 1 and 2 in Definition 1.3.2 are obviously true here. As for 
property 3, if A, B, and C are sets such that A ~B and B ~C, then A ~C. 
To show this, let f: A > B and h: B —> C be one-to-one and onto functions. 
Then, the composite function he f, where he f(x) = hl f(x)], defines a one- 
to-one correspondence between A and C. 


EXAMPLE 1.3.1. The relation apb, where a and b are real numbers such 
that a =b?, is not a function. This is true because both pairs (a,b) and 
(a,— b) belong to p. 


EXAMPLE 1.3.2. The relation apb, where a and b are real numbers such 
that b =2a’ +1, is a function, since for each a, there is only one b that is 
p-related to a. 


EXAMPLE 1.3.3. Let A = {x| —1 <x < 1}, B = {x|0 <x < 2}. Define 
f: AB such that f(x)=x?. Here, f is a function, but is not one-to-one 
because f(1) =f(—1) = 1. Also, f does not map A onto B, since y = 2 has no 
corresponding x in A such that x? = 2. 


EXAMPLE 1.3.4. Consider the relation xpy, where y = arcsin x, —1 < 
x <1. Here, y is an angle measured in radians whose sine is x. Since there 
are infinitely many angles with the same sine, p is not a function. However, if 
we restrict the range of y to the set B= {y| —7/2<y< 7/2}, then p 
becomes a function, which is also one-to-one and onto. This function is the 
inverse of the sine function x = sin y. We refer to the values of y that belong 
to the set B as the principal values of arcsin x, which we denote by writing 
y = Arcsin x. Note that other functions could have also been defined from 
the arcsine relation. For example, if 7/2 <y < 37/2, then x = sin y = —sin z, 
where z=y— m. Since —7/2<z< 7/2, then z= —Arcsinx. Thus y= 
a — Arcsin x maps the set A = {x| — 1 <x < 1} in a one-to-one manner onto 
the set C = {yl m/2 <y < 37/2}. 


1.4. FINITE, COUNTABLE, AND UNCOUNTABLE SETS 


Let J, = {1,2,..., n} be a set consisting of the first n positive integers, and let 
J* denote the set of all positive integers. 


Definition 1.4.1. A set A is said to be: 


1. Finite if A ~J, for some positive integer n. 

2. Countable if A ~J*. In this case, the set J*, or any other set equiva- 
lent to it, can be used as an index set for A, that is, the elements of A 
are assigned distinct indices (subscripts) that belong to J*. Hence, 
A can be represented as A = {a,,a),...,a,,...}. 


n 
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3. Uncountable if A is neither finite nor countable. In this case, the 
elements of A cannot be indexed by J,, for any n, or by J*. o 


EXAMPLE 1.4.1. Let A = {1,4,9,...,n°,...}. This set is countable, since 
the function f: Jt >A defined by f(n) =n? is one-to-one and onto. Hence, 
A~. 


EXAMPLE 1.4.2. Let A =J be the set of all integers. Then A is count- 
able. To show this, consider the function f: J’ A defined by 


(n+1)/2, nodd, 


Ta (2-n)/2, neven. 


It can be verified that f is one-to-one and onto. Hence, A ~J*. 


EXAMPLE 1.4.3. Let A ={x|0 <x < 1}. This set is uncountable. To show 
this, suppose that there exists a one-to-one correspondence between J* and 
A. We can then write A = {a,,a5,...,a,,...}. Let the digit in the nth decimal 
place of a, be denoted by b, (n = 1,2,...). Define a number c as c = 0-c,c, 

--c, ++ such that for each n, c, =1 if b, #1 and c, =2 if b, =1. Now, c 
belongs to A, since 0 <c < 1. However, by construction, c is different from 
every a; in at least one decimal digit (i = 1,2,...) and hence c ¢ A, which is a 
contradiction. Therefore, A is not countable. Since A is not finite either, 
then it must be uncountable. 

This result implies that any subset of R, the set of real numbers, that 
contains A, or is equivalent to it, must be uncountable. In particular, R is 
uncountable. 


Theorem 1.4.1. Every infinite subset of a countable set is countable. 


Proof. Let A be a countable set, and B be an infinite subset of A. Then 
A={a;,a5,...,a,,...}, where the a,’s are distinct elements. Let n; be the 
smallest positive integer such that a, E B. Let n, >n, be the next smallest 
integer such that a„, €B. In general, if nj <n,< © <n,_, have been 
chosen, let n, be the smallest integer greater than n,_, such that a, €B. 
Define the function f: J*—>B such that f(k)=a,, k=1,2,.... This func- 


tion is one-to-one and onto. Hence, B is countable. Oo 


Theorem 1.4.2. The union of two countable sets is countable. 


Proof. Let A and B be countable sets. Then they can be represented as 
A={a,,4),...,4,,-..},B = {b}, bo,...,b,,...}. Define C =A UB. Consider 
the following two cases: 


i. A and B are disjoint. 
ii. A and B are not disjoint. 
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In case i, let us write C as C={a,,b,,a5,bD),...,4,,5,,...}. Consider the 
function f: J* > C such that 


Ansiys2> A odd, 


f(n) =|, 


n/2> n even. 


It can be verified that f is one-to-one and onto. Hence, C is countable. 

Let us now consider case ii. If A A B +Ø, then some elements of C, 
namely those in A N B, will appear twice. Hence, there exists a set E cJ* 
such that E ~ C. Thus C is either finite or countable. Since C DA and A is 
infinite, C must be countable. o 


Corollary 1.4.1. If A,, A,,...,A,,-.-, are countable sets, then Uj_, A; 
is countable. 


Proof. The proof is left as an exercise. o 


Theorem 1.4.3. Let A and B be two countable sets. Then their Cartesian 
product A Xx B is countable. 


Proof. Let us write A as A=(a,,a),...,a,,...}. For a given a E4, 
define (a, B) as the set 


(a, B) ={(a,b)|b € B}. 


Then (a, B) ~B and hence (a, B) is countable. 
However, 


AXB= U (a; B). 
i=1 


Thus by Corollary 1.4.1, A X B is countable. o 


Corollary 1.4.2. If A,, A,,...,A, are countable sets, then their Carte- 
sian product X7_, A, is countable. 


Proof. The proof is left as an exercise. o 
Corollary 1.4.3. The set Q of all rational numbers is countable. 


Proof. By definition, a rational number is a number of the form m/n, 
where m and n are integers with n #0. Thus Q ~ Q, where 


OQ ={(m,n)|m,n are integers and n #0}. 
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Since O is an infinite subset of J XJ, where J is the set of all integers, which 
is countable as was seen in Example 1.4.2, then by Theorems 1.4.1 and 1.4.3, 
Q is countable and so is Q. Oo 


REMARK 1.4.1. Any real number that cannot be expressed as a rational 
number is called an irrational number. For example, V2 is an irrational 
number. To show this, suppose that there exist integers, m and n, such that 
V2 =m/n. We may consider that m/n is written in its lowest terms, that is, 
m and n have no common factors other than unity. In particular, m and n, 
cannot both be even. Now, m? = 2n?. This implies that m° is even. Hence, m 
is even and can therefore be written as m = 2m’. It follows that n? =m?/2 = 
2m'*. Consequently, n*, and hence n, is even. This contradicts the fact that 
m and n are not both even. Thus V2 must be an irrational number. 


1.5. BOUNDED SETS 
Let us consider the set R of real numbers. 
Definition 1.5.1. A set A CR is said to be: 


1. Bounded from above if there exists a number q such that x <q for all 
x in A. This number is called an upper bound of A. 

2. Bounded from below if there exists a number p such that x >p for all 
x in A. The number p is called a lower bound of A. 

3. Bounded if A has an upper bound q and a lower bound p. In this case, 
there exists a nonnegative number r such that —r<x <r for all x in 
A. This number is equal to max(|p|,|q]|). Oo 


Definition 1.5.2. Let A CR be a set bounded from above. If there exists 
a number / that is an upper bound of A and is less than or equal to any 
other upper bound of A, then / is called the least upper bound of A and is 
denoted by lub( A). Another name for lub(A) is the supremum of A and is 
denoted by sup( A). m 


Definition 1.5.3. Let A CR be a set bounded from below. If there exists 
a number g that is a lower bound of A and is greater than or equal to any 
other lower bound of A, then g is called the greatest lower bound and is 
denoted by glb( A). The infimum of A, denoted by inf(A), is another name 
for glb(A). o 


The least upper bound of A, if it exists, is unique, but it may or may not 
belong to A. The same is true for glb(A). The proof of the following theorem 
is omitted and can be found in Rudin (1964, Theorem 1.36). 
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Theorem 1.5.1. Let ACR be a nonempty set. 


1. If A is bounded from above, then lub( A) exists. 
2. If A is bounded from below, then glb(A) exists. 


EXAMPLE 1.5.1. Let A ={x|x <0}. Then lub(4)=0, which does not 
belong to A. 


EXAMPLE 1.5.2. Let A ={1/n|n =1,2,...}. Then lub(A) = 1 and glb(A) 
=Q. In this case, lub( A) belongs to A, but glb(A) does not. 


1.6. SOME BASIC TOPOLOGICAL CONCEPTS 


The field of topology is an abstract study that evolved as an independent 
discipline in response to certain problems in classical analysis and geometry. 
It provides a unifying theory that can be used in many diverse branches of 
mathematics. In this section, we present a brief account of some basic 
definitions and results in the so-called point-set topology. 


Definition 1.6.1. Let A be a set, and let F={B,} be a family of subsets 
of A. Then ¥ is a topology in A if it satisfies the following properties: 


1. The union of any number of members of Z is also a member of F: 


2. The intersection of a finite number of members of Z is also a member 
of F. 


3. Both A and the empty set @ are members of F. o 


Definition 1.6.2. Let F be a topology in a set A. Then the pair (A, F) is 
called a topological space. o 


Definition 1.6.3. Let (A, F) be a topological space. Then the members of 
F are called the open sets of the topology ¥. o 


Definition 1.6.4. Let (A, 7) be a topological space. A neighborhood of a 
point p EA is any open set (that is, a member of F) that contains p. In 
particular, if A = R, the set of real numbers, then a neighborhood of p € R 
is an open set of the form N,(p) = {q| Iq -p| <r} for some r > 0. oO 


Definition 1.6.5. Let (A,#) be a topological space. A family G = {B} CF 
is called a basis for F if each open set (that is, member of F) is the union of 
members of G. o 


On the basis of this definition, it is easy to prove the following theorem. 
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Theorem 1.6.1. Let (4,7) be a topological space, and let G be a basis 
for F. Then a set BCA is open (that is, a member of ¥) if and only if for 
each p €B, there is a U € G such that p € UCB. 


For example, if A =R, then G={N(p)|p €R, r> 0} is a basis for the 
topology in R. It follows that a set B CR is open if for every point p in B, 
there exists a neighborhood N,(p) such that N,(p) cB. 


Definition 1.6.6. Let (A,#) be a topological space. A set B CA is closed 
if B, the complement of B with respect to A, is an open set. o 


It is easy to show that closed sets of a topological space (A, 7) satisfy the 
following properties: 


1. The intersection of any number of closed sets is closed. 
2. The union of a finite number of closed sets is closed. 
3. Both A and the empty set Ø are closed. 


Definition 1.6.7. Let (A,F) be a topological space. A point p € A is said 
to be a limit point of a set B CA if every neighborhood of p contains at least 
one element of B distinct from p. Thus, if U(p) is any neighborhood of p, 
then U(p) MB is a nonempty set that contains at least one element besides 
p. In particular, if A = R, the set of real numbers, then p is a limit point of a 
set BCR if for any r>0, Np) N[B—{p}] # ©, where {p} denotes a set 
consisting of just p. o 


Theorem 1.6.2. Let p be a limit point of a set BCR. Then every 
neighborhood of p contains infinitely many points of B. 


Proof. The proof is left to the reader. E 


The next theorem is a fundamental theorem in set theory. It is originally 
due to Bernhard Bolzano (1781-1848), though its importance was first 
recognized by Karl Weierstrass (1815-1897). The proof is omitted and can be 
found, for example, in Zaring (1967, Theorem 4.62). 


Theorem 1.6.3 (Bolzano-Weierstrass). Every bounded infinite subset of 
R, the set of real numbers, has at least one limit point. 


Note that a limit point of a set B may not belong to B. For example, the 
set B={1/n|n =1,2,...} has a limit point equal to zero, which does not 
belong to B. It can be seen here that any neighborhood of 0 contains 
infinitely many points of B. In particular, if r is a given positive number, then 
all elements of B of the form 1/n, where n >1/r, belong to N,(0). From 
Theorem 1.6.2 it can also be concluded that a finite set cannot have limit 
points. 
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Limit points can be used to describe closed sets, as can be seen from the 
following theorem. 


Theorem 1.6.4. A set B is closed if and only if every limit point of B 
belongs to B. 


Proof. Suppose that B is closed. Let p be a limit point of B. If p £B, 
then p €B, which is open. Hence, there exists a neighborhood U(p) of p 
contained inside B by Theorem 1.6.1. This means that U(p) B = Ø, a 
contradiction, since p is a limit point of B (see Definition 1.6.7). Therefore, 
p must belong to B. Vice versa, if every limit point of B is in B, then B must 
be closed. To show this, let p be any point in B. Then, p is not a limit point 
of B. Therefore, there exists a neighborhood U(p) such that U(p) CB. This 
means that B is open and hence B is closed. o 


It should be noted that a set does not have to be either open or closed; if 
it is closed, it does not have to be open, and vice versa. Also, a set may be 
both open and closed. 


EXAMPLE 1.6.1. B={x|0<x<1} is an open subset of R, but is not 
closed, since both 0 and 1 are limit points of B, but do not belong to it. 


EXAMPLE 1.6.2. B={x|0<x <1} is closed, but is not open, since any 
neighborhood of 0 or 1 is not contained in B. 


EXAMPLE 1.6.3. B= {x|0 <x < 1} is not open, because any neighborhood 
of 1 is not contained in B. It is also not closed, because 0 is a limit point that 
does not belong to B. 


EXAMPLE 1.6.4. The set R is both open and closed. 


EXAMPLE 1.6.5. A finite set is closed because it has no limit points, but is 
obviously not open. 


Definition 1.6.8. A subset B of a topological space (4, F) is disconnected 
if there exist open subsets C and D of A such that BNC and BMD are 
disjoint nonempty sets whose union is B. A set is connected if it is not 
disconnected. o 


The set of all rationals Q is disconnected, since {xlx > V2 }NO and 
{x|x < V2} Q are disjoint nonempty sets whose union is Q. On the other 
hand, all intervals in R (open, closed, or half-open) are connected. 


Definition 1.6.9. A collection of sets {B,} is said to be a covering of a set 
A if the union U,B, contains A. If each B, is an open set, then {B,} is 
called an open covering. 
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Definition 1.6.10. A set A in a topological space is compact if each open 
covering {B,} of A has a finite subcovering, that is, there is a finite 
subcollection Ba, By,,..-,B,, of {B,} such that A C U7) Ba; o 


an 


The concept of compactness is motivated by the classical Heine—Borel 
theorem, which characterizes compact sets in R, the set of real numbers, as 
closed and bounded sets. 


Theorem 1.6.5 (Heine—Borel). A set BCR is compact if and only if it is 
closed and bounded. 


Proof. See, for example, Zaring (1967, Theorem 4.78). Oo 


Thus, according to the Heine—Borel theorem, every closed and bounded 
interval [a, b] is compact. 


1.7. EXAMPLES IN PROBABILITY AND STATISTICS 


EXAMPLE 1.7.1. In probability theory, events are considered as subsets in 
a sample space ©, which consists of all the possible outcomes of an experi- 
ment. A Borel field of events (also called a o-field) in Q is a collection F& of 
events with the following properties: 


i QEZ. 
ii. If E €Z, then E €Z, where E is the complement of E. 
iii. If E}, £,,...,£,,... is a countable collection of events in Z, then 


U‘_,E; belongs to Z. 


The probability of an event E is a number denoted by P(E) that has the 
following properties: 


i O< P(E) <1. 


ii, PCO) =1. 
iii. If E}, E,,...,£,,... is a countable collection of disjoint events in Z, 
then 


r| ÚE: z E P(E). 


i=1 
By definition, the triple (Q, Z, P) is called a probability space. 
EXAMPLE 1.7.2 . A random variable X defined on a probability space 


(0, Z,P) is a function X: Q —A, where A is a nonempty set of real 
numbers. For any real number x, the set E={w€Q|X(w) <x} is an 
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element of &. The probability of the event E is called the cumulative 
distribution function of X and is denoted by F(x). In statistics, it is custom- 
ary to write just X instead of X(w). We thus have 


F(x) =P(X <x). 


This concept can be extended to several random variables: Let X,, X,,..., Xn 


be n random variables. Define the event A;={w€O|X(@) <x}, i= 
1,2,...,. Then, P((}_,A;), which can be expressed as 


F(X1,X2,...,%,) = P(X; <x, X3 <Xp,...,X, <X,), 


is called the joint cumulative distribution function of X,, X,,..., X,,. In this 
case, the n-tuple (X,, X,,..., X,,) is said to have a multivariate distribution. 

A random variable X is said to be discrete, or to have a discrete 
distribution, if its range is finite or countable. For example, the binomial 
random variable is discrete. It represents the number of successes in a 
sequence of n independent trials, in each of which there are two possible 
outcomes: success or failure. The probability of success, denoted by p,,, is the 
same in all the trials. Such a sequence of trials is called a Bernoulli sequence. 
Thus the possible values of this random variable are 0,1,...,7. 

Another example of a discrete random variable is the Poisson, whose 
possible values are 0,1,2,.... It is considered to be the limit of a binomial 
random variable as n > œ% in such a way that np, — A >0. Other examples of 
discrete random variables include the discrete uniform, geometric, hypergeo- 
metric, and negative binomial (see, for example, Fisz, 1963; Johnson and 
Kotz, 1969; Lindgren 1976; Lloyd, 1980). 

A random variable X is said to be continuous, or to have a continuous 
distribution, if its range is an uncountable set, for example, an interval. In 
this case, the cumulative distribution function F(x) of X is a continuous 
function of x on the set R of all real numbers. If, in addition, F(x) is 
differentiable, then its derivative is called the density function of X. One of 
the best-known continuous distributions is the normal. A number of continu- 
ous distributions are derived in connection with it, for example, the chi- 
squared, F, Rayleigh, and ¢ distributions. Other well-known continuous 
distributions include the beta, continuous uniform, exponential, and gamma 
distributions (see, for example, Fisz, 1963; Johnson and Kotz, 1970a, b). 


EXAMPLE 1.7.3. Let f(x, 0) denote the density function of a continuous 
random variable X, where 0 represents a set of unknown parameters that 
identify the distribution of X. The range of X, which consists of all possible 
values of X, is referred to as a population and denoted by Py. Any subset of 
n elements from Py forms a sample of size n. This sample is actually an 
element in the Cartesian product Py. Any real-valued function defined on 
P% is called a statistic. We denote such a function by g(X,, X,,...,X,,), 
where each X, has the same distribution as X. Note that this function is a 
random variable whose values do not depend on 6. For example, the sample 


mean X= "_,X;,/n and the sample variance S$? = L!_,(X;-X)*/(n — 1) 
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are statistics. We adopt the convention that whenever a particular sample of 
size n is chosen (or observed) from Py, the elements in that sample are 
written using lowercase letters, for example, x,,x,,...,x,. The correspond- 
ing value of a statistic is written as g(x,,x,...,%,). 

EXAMPLE 1.7.4. Two random variables, X and Y, are said to be equal in 
distribution if they have the same cumulative distribution function. This fact 
is denoted by writing X £Y. The same definition applies to random variables 
with multivariate distributions. We note that £ is an equivalence relation, 
since it satisfies properties 1, 2, and 3 in Definition 1.3.2. The first two 
properties are obviously true. As for property 3, if X4Y and Y£Z, then 
X = Z, which implies that all three random variables have the same cumula- 
tive distribution function. This equivalence relation is useful in nonparamet- 
ric statistics (see Randles and Wolfe, 1979). For example, it can be shown 
that if X has a distribution that is symmetric about some number u, then 
X-u £ u- x. Also, if X,, X,,..., X, are independent and identically dis- 


tributed random variables, and if (m,,m),,...,m,,) is any permutation of the 
n-tuple (1,2,..., n), then (X,, X5,..., Xp) = (Xm, Ximes Xm) In this case, 
we say that the collection of random variables X,, X,,...,X,, is exchange- 


able. 


EXAMPLE 1.7.5. Consider the problem of testing the null hypothesis H: 
0< 6) versus the alternative hypothesis H,: 0> 6), where 0 is some un- 
known parameter that belongs to a set A. Let T be a statistic used in making 
a decision as to whether H) should be rejected or not. This statistic is 
appropriately called a test statistic. 

Suppose that H, is rejected if T >t, where t is some real number. Since 
the distribution of T depends on 6, then the probability P(T >t) is a 
function of 6, which we denote by 7(@). Thus m: A > [0,1]. Let By be a 
subset of A defined as By ={60€<A]|@< 0,}. By definition, the size of the test 
is the least upper bound of the set (Bo). This probability is denoted by a 
and is also called the level of significance of the test. We thus have 


a= sup 7(6). 
<6 


To learn more about the above examples and others, the interested reader 
may consider consulting some of the references listed in the annotated 
bibliography. 
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EXERCISES 
In Mathematics 


1.1. Verify Results 1.2.3-1.2.12. 
1.2. Verify Results 1.2.13-1.2.16. 


1.3. Let A, B, and C be sets such that AN BCC and A UC CB. Show 
that A and C are disjoint. 


1.4. Let A, B, and C be sets such that C = (A — B) U (B — A). The set C is 
called the symmetric difference of A and B and is denoted by A 4B. 
Show that 


(a) AAB=AUB-ANB 
(b) A A(B A D)=(Aa B)AD, where D is any set. 
(c) AN(BA D)=(ANB)A(AND), where D is any set. 


1.5. Let A=J*xXJ*, where J* is the set of positive integers. Define a 
relation p in A as follows: If (m,,n,) and (m,,n,) are elements in A, 
then (m,,n,) p(m,,n,) if m,n, =n,m,. Show that p is an equivalence 
relation and describe its equivalence classes. 


1.6. Let A be the same set as in Exercise 1.5. Show that the following 
relation is an equivalence relation: (m,,n,) p (m3, n3) if m, +n, =n; 
+m,. Draw the equivalence class of (1, 2). 


1.7. Consider the set A = {(—2,— 5), (—1,— 3), (1,2), G, 10)}. Show that A 
defines a function. 


1.8. Let A and B be two sets and f be a function defined on A such that 
f(A) CB. If A,, A,..., A, are subsets of A, then show that: 


(a) f(U%_,4) = Uy (A;). 
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1.12. 


1.13. 


1.20. 
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(b) FNA ADENA fA). 


Under what conditions are the two sides in (b) equal? 


. Prove Corollary 1.4.1. 
. Prove Corollary 1.4.2. 


. Show that the set A = (3,9, 19, 33,51, 73,...} is countable. 


Show that V3 is an irrational number. 


Let a, b, c, and d be rational numbers such that a+ Vb =c + Vd. 
Then, either 


(a) a=c,b=d, or 
(b) b and d are both squares of rational numbers. 


. Let ACR be a nonempty set bounded from below. Define —A to be 


the set {—x|x E€ A}. Show that inf(A) = —sup(—A). 


. Let A CR be a closed and bounded set, and let sup( 4) = b. Show that 


b EA. 


. Prove Theorem 1.6.2. 


. Let (A, F) be a topological space. Show that G CF is a basis for ¥ in 


and only if for each B € Z and each p € B, there is a U € G such that 
pEUCB. 


. Show that if A and B are closed sets, then A U B is a closed set. 


. Let BCA be a closed subset of a compact set A. Show that B is 


compact. 


Is a compact subset of a compact set necessarily closed? 


In Statistics 


1.21. 


Let X be a random variable. Consider the following events: 
A ,={wEQIX(w)<x+37"}, n=1,2,..., 

n= {@EQX(@) <x-3-"}, n=1,2,..., 

A={wE0|X(o) <x}, 

B={wEQ|X(w) <x}, 
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1.22. 


1.23. 


1.24. 


1.25. 


where x is a real number. Show that for any x, 
(a) N*%_,A, =A; 
b) U%_,B, =B. 


Let X be a nonnegative random variable such that E(X) = wp is finite, 
where E(X) denotes the expected value of X. The following inequal- 
ity, known as Markovu’s inequality, is true: 


P(XEh) <=, 


where h is any positive number. Consider now a Poisson random 

variable with parameter A. 

(a) Find an upper bound on the probability P(X > 2) using Markov’s 
inequality. 

(b) Obtain the exact probability value in (a), and demonstrate that it is 
smaller than the corresponding upper bound in Markov’s inequal- 
ity. 


Let X be a random variable whose expected value mw and variance ø? 


exist. Show that for any positive constants c and k, 

(a) P(X- ul >c)<a7/c’, 

(b) P(X- ul =>ko) <1/k?, 

(c) P(X- u| <ko)>1- 1/k?. 

The preceding three inequalities are equivalent versions of the so-called 
Chebyshev’s inequality. 


Let X be a continuous random variable with the density function 


ORP 


By definition, the density function of X is a nonnegative function such 

that F(x) = fž„f(t)dt, where F(x) is the cumulative distribution func- 

tion of X. 

(a) Apply Markov’s inequality to finding upper bounds on the following 
probabilities: (i) P(|X| = 4); (ii) P(X| > 4). 

(b) Compute the exact value of P(|X| > 4), and compare it against the 
upper bound in (a)(i). 


STII, 
elsewhere. 


Let X,, X,,...,X, be n continuous random variables. Define the 
random variables Xa) and Xin) as 
Xa = min {X,,X,,...,X,}, 


1l<i<n 


Xin) = max {X}, X2,...,Xy}. 


1l<i<n 
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1.26. 
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Show that for any x, 

(a) P(Xq) =x) = P(X, 2x, X, >x,..., X, =x), 

(b) P(X.) <x) = P(X, <x, X, <x,...,X, Sx). 

In particular, if X,,X,,...,X, form a sample of size n from a 
population with a cumulative distribution function F(x), show that 

(c) P(Xy <x) =1-[1-FO)), 

(d) P(X on <x) =[F)F. 

The statistics Xa) and Xe are called the first-order and nth-order 
statistics, respectively. 


Suppose that we have a sample of size n = 5 from a population with an 
exponential distribution whose density function is 


_|2e™?*, x>0, 
F(=) fe elsewhere. 


Find the value of P(2 < Xo) < 3). 


CHAPTER 2 


Basic Concepts in Linear Algebra 


In this chapter we present some fundamental concepts concerning vector 
spaces and matrix algebra. The purpose of the chapter is to familiarize the 
reader with these concepts, since they are essential to the understanding of 
some of the remaining chapters. For this reason, most of the theorems in this 
chapter will be stated without proofs. There are several excellent books on 
linear algebra that can be used for a more detailed study of this subject (see 
the bibliography at the end of this chapter). 

In statistics, matrix algebra is used quite extensively, especially in linear 
models and multivariate analysis. The books by Basilevsky (1983), Graybill 
(1983), Magnus and Neudecker (1988), and Searle (1982) include many 
applications of matrices in these areas. 

In this chapter, as well as in the remainder of the book, elements of the 
set of real numbers, R, are sometimes referred to as scalars. The Cartesian 
product x/_,R is denoted by R”, which is also known as the n-dimensional 
Euclidean space. Unless otherwise stated, all matrix elements are considered 
to be real numbers. 


2.1. VECTOR SPACES AND SUBSPACES 


A vector space over R is a set V of elements called vectors together with two 
operations, addition and scalar multiplication, that satisfy the following 
conditions: 


. u +v is an element of V for all u,v in V. 

. If a is a scalar and u E V, then au EV. 
u+v=v +u for all u,v in V. 

. u+ (v +w)= (u +v)+w for all u,v,w in V. 


. There exists an element 0 € V such that 0 + u = u for all u in V. This 
element is called the zero vector. 


ak WN = 
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. For each u € V there exists a v € V such that u + v = 0. 
alu + v) = au + av for any scalar a and any u and v in V. 
. (a+ G)u = au + Bu for any scalars a and 8 and any u in V. 


SoN A 


a( Bu) = (aß )u for any scalars œ and £ and any u in V. 
10. 1u = u for any u E V. 


EXAMPLE 2.1.1. A familiar example of a vector space is the n-dimen- 
sional Euclidean space R”. Here, addition and multiplication are defined as 


follows: If (u,,u5,...,u,,) and (v,,v,,...,v,) are two elements in R”, then 
their sum is defined as (u; + v4, U3 + V2, ..., U, +U,). If @ is a scalar, then 
alu, Uz, ..., U) = (QU, AU,,..., au,). 


EXAMPLE 2.1.2. Let V be the set of all polynomials in x of degree less 
than or equal to k. Then V is a vector space. Any element in V can be 
expressed as LD‘ _,a;x', where the a,’s are scalars. 


EXAMPLE 2.1.3. Let V be the set of all functions defined on the closed 
interval [— 1,1]. Then V is a vector space. It can be seen that f(x) + g(x) and 
af(x) belong to V, where f(x) and g(x) are elements in V and a is any 
scalar. 


EXAMPLE 2.1.4. The set V of all nonnegative functions defined on [—1, 1] 
is not a vector space, since if f(x) <V and a is a negative scalar, then 
af(x) EV. 


EXAMPLE 2.1.5. Let V be the set of all points (x,y) on a straight line 
given by the equation 2x —y + 1=0. Then V is not a vector space. This is 
because if (x,, y1) and (x), y2) belong to V, then (x, +x3, y, +y.) EV, since 
2(x, +x,) —(y, +y,) +1 = —1 +0. Alternatively, we can state that V is not 
a vector space because the zero element (0,0) does not belong to V. This 
violates condition 5 for a vector space. 


A subset W of a vector space V is said to form a vector subspace if W 
itself is a vector space. Equivalently, W is a subspace if whenever u,v € W 
and « is a scalar, then u + v E€ W and «œu € W. For example, the set W of all 
continuous functions defined on [— 1, 1] is a vector subspace of V in Example 
2.1.3. Also, the set of all points on the straight line y — 2x = 0 is a vector 
subspace of R*. However, the points on any straight line in R? not going 
through the origin (0,0) do not form a vector subspace, as was seen in 
Example 2.1.5. 


Definition 2.1.1. Let V be a vector space, and u,,u,,...,u,, be a collec- 
tion of n elements in V. These elements are said to be linearly dependent if 
there exist n scalars a), a,,..., @,, not all equal to zero, such that L7_, a,u,; 
= 0. If, however, L7_, a,u; = 0 is true only when all the a;,’s are zero, then 
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u,,U,,...,u,, are linearly independent. It should be noted that if u,,u,,...,u,, 
are linearly independent, then none of them can be zero. If, for example, 
u; =0, then au, + Ou, + = +0u, =0 for any a #0, which implies that the 
u,’s are linearly dependent, a contradiction. o 


From the preceding definition we can say that a collection of n elements 
in a vector space are linearly dependent if at least one element in this 
collection can be expressed as a linear combination of the remaining n — 1 
elements. If no element, however, can be expressed in this fashion, then the 
n elements are linearly independent. For example, in R°, (1,2, — 2), (— 1,0,3), 
and (1,4,— 1) are linearly dependent, since 2(1,2,— 2) + (—1,0,3)-— 
(1,4, —1) = 0. On the other hand, it can be verified that (1,1,0), (1,0,2), and 
(0,1,3) are linearly independent. 


Definition 2.1.2. Let u,,U,,...,u, be n elements in a vector space V. 
The collection of all linear combinations of the form L?_, a;u;, where the a,’s 
are scalars, is called a linear span of u,,u,,...,u, and is denoted by 


Ltu,,u,,...,u,,). m 


It is easy to see from the preceding definition that L(u,,u,,...,u,,) is a 
vector subspace of V. This vector subspace is said to be spanned by 
U,,U5,...,U,,. 


Definition 2.1.3. Let V be a vector space. If there exist linearly indepen- 
dent elements u,,u,,...,u, in V such that V=L(u,,u,,...,u,), then 
u,,U,,...,u,, are said to form a basis for V. The number n of elements in 
this basis is called the dimension of the vector space and is denoted by dim V. 

m 


Note that a basis for a vector space is not unique. However, its dimension 
is unique. For example, the three vectors (1,0,0), (0, 1,0), and (0,0, 1) form a 
basis for R°. Another basis for R? consists of (1, 1, 0), (1,0, 1), and (0,1,1). 

If u,,u,,...,u,, form a basis for V and if u is a given element in V, then 
there exists a unique set of scalars, a,, @,,..., a,, Such that u = L?_, a,u;. To 
show this, suppose that there exists another set of scalars, 64, B,,..., Bno 
such that u = ©7_, Bu. Then £7_,(a; — Bu; = 0, which implies that a; = B; 
for all i, since the u,’s are linearly independent. 

Let us now check the dimensions of the vector spaces for some of the 
examples described earlier. For Example 2.1.1, dim V = n. In Example 2.1.2, 
{1, x, x?,...,x*} is a basis for V; hence dim V =k + 1. As for Example 2.1.3, 
dim V is infinite, since there is no finite set of functions that can span V. 


Definition 2.1.4. Let u and v be two vectors in R”. The dot product (also 
called scalar product or inner product) of u and v is a scalar denoted by u-v 
and is given by 
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where u; and v; are the ith components of u and v, respectively (i= 
1,2,...,). In particular, if u = v, then (u-u)? =(L"_,u?)'” is called the 
Euclidean norm (or length) of u and is denoted by |lul|,. The dot product of 
u and v is also equal to |lull2||vll2 cos 0, where @ is the angle between u and v. 

Oo 


Definition 2.1.5. Two vectors u and v in R” are said to be orthogonal if 
their dot product is zero. H 


Definition 2.1.6. Let U be a vector subspace of R”. The vectors 


€,,€5,...,e,, form an orthonormal basis for U if they satisfy the following 
properties: 

1. e;,€5,...,e,, form a basis for U. 

2. e;-e; = 0 for all i #j (i, j =1,2,..., m). 

3. |le;ll,=1 for i=1,2,...,m. 


Any collection of vectors satisfying just properties 2 and 3 are said to be 
orthonormal. o 


Theorem 2.1.1. Let u,,u,,...,u,, be a basis for a vector subspace U of 
R”. Then there exists an orthonormal basis, e,,e,,...,e,,, for U, given by 
vy 
e = ; where v; = u}, 
lvill2 
V2 Wye 
e= ; where v, = u, — ——;V¥,, 
llv2ll2 lv ll2 
v, mlyu 
m L m 
en = , where v, =u„— > Vj: 
IIv,nll2 i=1 llyjll 
Proof. See Graybill (1983, Theorem 2.6.5). m 


The procedure of constructing an orthonormal basis from any given basis 
as described in Theorem 2.1.1 is known as the Gram-Schmidt orthonormal- 
ization procedure. 


Theorem 2.1.2. Let u and v be two vectors in R”. Then: 


1. |u-v| < |lull2llvll2. 
2. |lu + vil2 < llull2 + Ilvll2. 


Proof. See Marcus and Minc (1988, Theorem 3.4). o 
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The inequality in part 1 of Theorem 2.1.2 is known as the Cauchy—Schwarz 
inequality. The one in part 2 is called the triangle inequality. 


Definition 2.1.7. Let U be a vector subspace of R”. The orthogonal 
complement of U, denoted by U+, is the vector subspace of R” which 
consists of all vectors v such that u-v = 0 for all u in U. o 


Definition 2.1.8. Let U,,U,,...,U, be vector subspaces of the vector 
space U. The direct sum of these vector subspaces, denoted by ®; 1U; 
consists of all vectors u that can be uniquely expressed as u = L"_,u,, where 
u; € U, i=1,2,...,n. o 


Theorem 2.1.3. Let U;,U;,...,U,, be vector subspaces of the vector space 
U. Then: 


L 


2. If U= @;_,U,, then /_,U, consists of just the zero element 0 of U. 
3. dim ®;_,U, = £;_; dim U,. 


1. @/_,U, is a vector subspace of U. 


Proof. The proof is left as an exercise. o 


Theorem 2.1.4. Let U be a vector subspace of R”. Then R” =U U+. 


Proof. See Marcus and Minc (1988, Theorem 3.3). o 


From Theorem 2.1.4 we conclude that any v € R” can be uniquely written 
as v = v; + v,, where v; € U and v, € U +. In this case, v, and v, are called 
the projections of v on U and U +, respectively. 


2.2. LINEAR TRANSFORMATIONS 


Let U and V be two vector spaces. A function T: U > V is called a linear 
transformation if T(œ;u; + a,u,)=a,T(u,)+ a,T(u,) for all uj,u, in U 
and any scalars a, and a,. For example, let T: R? > R? be defined as 


T(x1, X2, X3) = (4, = x2, X1 +X3, X3). 
Then T is a linear transformation, since 
T[ a(x x2; x3) + B(Y1 Y2 Y3)] 
=T( ax, + By, ax, + By2, ax; + By3) 
= (ax, + By, — ax, — By,, ax, + By, + ax, + By;, ax; + Byz) 
= a(X X2, X1 +%3,%3) + BC Y1 Y2; Y1 +3, Y3) 
= aT (X,,X.,%;) + BT (1,925 Y3). 
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We note that the image of U under T, or the range of T, namely T(U), is 
a vector subspace of V. This is true because if v,,v, are in T(U), then there 
exist u, and u, in U such that v; = T(u,) and v, = T(u,). Hence, v; + v, = 
Ttu,)+ T(u,)= T(u; + u,), which belongs to T(U). Also, if œ is a scalar, 
then aT(u) = T(au) € TW) for any u E U. 


Definition 2.2.1. Let T: U —>V bea linear transformation. The kernel of 
T, denoted by ker T, is the collection of all vectors u in U such that T(u) = 0, 
where 0 is the zero vector in V. The kernel of T is also called the null space 
of T. 

As an example of a kernel, let T: R? > R? be defined as T(x,, x7, x3) = 
(x, — x3, x4 — x3). Then 


ker T = {( x1, x2, x3) |x, =x, X1 = x3} 


In this case, ker T consists of all points (x4, x2, x3) in R? that lie on a straight 
line through the origin given by the equations x, =x, = x3. o 


Theorem 2.2.1. Let T: U-V be a linear transformation. Then we have 
the following: 


1. ker T is a vector subspace of U. 
2. dim U = dim(ker T) + dim[ T (U )]. 


Proof. Part 1 is left as an exercise. To prove part 2 we consider the 
following. Let dim U =n, dim(ker T) =p, and dim[T(U)] =q. Let 
U,,U5,...,U, be a basis for ker T, and v,,v,,... Vy be a basis for T(U). Then, 
there exist vectors w1, W3, ..., W; in U such that T(w,) = v; (i = 1,2,...,q). We 
need to show that uj,U5,...,U,; W1, W2,..., W} form a basis for U, that is, 
they are linearly independent and span U. 

Suppose that there exist scalars a,, 2 -> ap; By, B,..+5 By such that 


Sad ee (2.1) 


i=1 i=1 
Then 


P q 
0=T| Viau,+ Ł am) 
where 0 represents the zero vector in V 


p 
“2 a;T(u;) + z BiT (w;) 


BT (w,), since u;€kerT,i=1,2,..., p 
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Since the v;s are linearly independent, then 6,=0 for i=1,2,...,q. From 
(2.1) it follows that a;=0 for i= 1,2,..., p, since the u,’s are also linearly 
independent. Thus the vectors u,,U5,...,U,,; W1, W2, ..., W} are linearly inde- 
pendent. 

Let us now suppose that u is any vector in U. To show that it belongs to 
Lu,,U,,..-,U,; Wi Was... W3). Let v= T(u). Then there exist scalars 
41, 4,...,4, such that v = }}_,a;v;. It follows that 


q 
T(u) = } aT (w;) 
i=1 
Thus, 


and u — 4_,a;w, must then belong to ker T. Hence, 


q P 
u- 4;W; = 2 bu; (2.2) 
i=1 i=1 
for some scalars, b,,b,,...,b,. From (2.2) we then have 


P q 
u= D b;u; + 2 a;Wi, 
i=1 i=1 


which shows that u belongs to the linear span of u,,U,,...,U,,; W1, W25... Wg: 
We conclude that these vectors form a basis for U. Hence, n=p+q. 
Oo 


Corollary 2.2.1. T: U—V is a one-to-one linear transformation if and 
only if dim(ker T) = 0. 


Proof. If T is a one-to-one linear transformation, then ker 7 consists of 
just one vector, namely, the zero vector. Hence, dim(ker T) = 0. Vice versa, if 
dim(ker T) = 0, or equivalently, if ker T consists of just the zero vector, then 
T must be a one-to-one transformation. This is true because if u, and u, are 
in U and such that T(u,) = T(u,), then T(u, — u,)=0, which implies that 
u; — u, € ker T and thus u; —u, = 0. o 


2.3. MATRICES AND DETERMINANTS 


Matrix algebra was devised by the English mathematician Arthur Cayley 
(1821-1895). The use of matrices originated with Cayley in connection with 
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linear transformations of the form 


ax, + bx, =y}, 


cx, + dX, =Yo, 


where a, b, c, and d are scalars. This transformation is completely deter- 
mined by the square array 


which is called a matrix of order 2 X 2. In general, let T: U >V be a linear 
transformation, where U and V are vector spaces of dimensions m and n, 
respectively. Let u,,u,,...,u,, be a basis for U and v,,v,,...,v, be a basis for 
V. For i=1,2,...,m, consider T(u;), which can be uniquely represented as 


n 
T(u;) = È ay; i=1,2,...,m, 
j=l 


where the a;;’s are scalars. These scalars completely determine all possible 
values of T: If u € U, then u = Y",c,u, for some scalars c,,c,,...,C¢,,. Then 


Tœ) = X71, ¢,Tu,) = Li ,c(L4_, 4;;V;). By definition, the rectangular array 


44, Ayn Qin 
a a a 
21 22 2n 
A= 7 ; 
a mı Am2 Ann 


is called a matrix of order m X n, which indicates that A has m rows and n 
columns. The a;,,’s are called the elements of A. In some cases it is more 
convenient to represent A using the notation A= (a;;). In particular, if 
m=n, then A is called a square matrix. Furthermore, if the off-diagonal 
elements of a square matrix A are zero, then A is called a diagonal matrix and 
is written as A= Diag(a,,,a5),...,4,,,). In this special case, if the diagonal 
elements are equal to 1, then A is called the identity matrix and is denoted by 
I,, to indicate that it is of order n Xn. A matrix of order m X 1 is called a 
column vector. Likewise, a matrix of order 1 Xn is called a row vector. 


2.3.1. Basic Operations on Matrices 


1. Equality of Matrices. Let A= (a;;) and B = (b;;) be two matrices of the 
same order. Then A =B if and only if a;,;=b;; for all i= 1,2,...,m; 
j=1,2,...,7. 
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2. Addition of Matrices. Let A = (a;;) and B = (b,;) be two matrices of 
order m Xn. Then A + B is a matrix C = (c;;) of order m Xn such that 
cj=a; +b; (i=1,2,...,m; j=1,2,...,n). 

3. Scalar Multiplication. Let a be a scalar, and A = (a;;) be a matrix of 
order m Xn. Then aA = (aa;;). 

4. The Transpose of a Matrix. Let A=(a;;) be a matrix of order m Xn. 
The transpose of A, denoted by A’, is a matrix of order n Xm whose 
rows are the columns of A. For example, 


2ra =] 
if a|? 5 1 then A’=|3 ol. 
1 7 
A matrix A is symmetric if A = A’. It is skew-symmetric if A’ = —A. 
A skew-symmetric matrix must necessarily have zero elements along its 


diagonal. 

5. Product of Matrices. Let A = (a;;) and B = (b;;) be matrices of orders 
m Xn and n Xp, respectively. The product AB is a matrix C = (c,;) of 
order m X p such that c;; = Lf-14;,b,; @=1,2,...,m; j=1,2,..., p). 
It is to be noted that this product is defined only when the number of 
columns of A is equal to the number of rows of B. 

In particular, if a and b are column vectors of order n X 1, then their 
dot product a:b can be expressed as a matrix product of the form a'b 
or b'a. 

6. The Trace of a Matrix. Let A = (a;;) be a square matrix of order n Xn. 
The trace of A, denoted by tr(A), is the sum of its diagonal elements, 
that is, 


tr(A) = X a; 
i=1 


On the basis of this definition, it is easy to show that if A and B are 
matrices of order n Xn, then the following hold: (i) tr(AB) = tr(BA); 
(ii) tr(A + B) = tr(A) + tr(B). 


Definition 2.3.1. Let A = (a;;) be an m Xn matrix. A submatrix B of A is 
a matrix which can be obtained from A by deleting a certain number of rows 
and columns. 

In particular, if the ith row and jth column of A that contain the element 
a;; are deleted, then the resulting matrix is denoted by M;; Gi =1,2,...,m; 
j=1,2,...,n). 

Let us now suppose that A is a square matrix of order n Xn. If rows 
i,,i,,...,2, and columns l l2,..., i, are deleted from A, where p <n, then 
the resulting submatrix is called a principal submatrix of A. In particular, if 
the deleted rows and columns are the last p rows and the last p columns, 
respectively, then such a submatrix is called a leading principal submatrix. 
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Definition 2.3.2. A partitioned matrix is a matrix that consists of several 
submatrices obtained by drawing horizontal and vertical lines that separate it 
into groups of rows and columns. 

For example, the matrix 


-0 
y) 


a2 


1 3 
A=|6 10 - 
3 ie 


(a A O i a 


is partitioned into six submatrices by drawing one horizontal line and two 
vertical lines as shown above. 


Definition 2.3.3. Let A = (a;;) be an m, Xn, matrix and B be an m, Xn, 
matrix. The direct (or Kronecker) product of A and B, denoted by A ® B, is a 
matrix of order m,m, X nın, defined as a partitioned matrix of the form 


aB a B aoe ai, B 
a>,B An B aD ay, B 
A®B= . $ 
a, 1B ay. B ae amn B 
This matrix can be simplified by writing A 8 B = [a;;B]. o 


Properties of the direct product can be found in several matrix algebra 
books and papers. See, for example, Graybill (1983, Section 8.8), Henderson 
and Searle (1981), Magnus and Neudecker (1988, Chapter 2), and Searle 
(1982, Section 10.7). Some of these properties are listed below: 


1. (A 8 BY =A’ OB’. 

2. A®@(B®C)=(A@B) eC. 

3. (A 8 BXC 8 D) = AC ® BD, if AC and BD are defined. 
4. tr(A 8 B) = tr(A)tr(B), if A and B are square matrices. 


The paper by Henderson, Pukelsheim, and Searle (1983) gives a detailed 
account of the history associated with direct products. 


Definition 2.3.4. Let A,,A>,...,A, be matrices of orders m;Xn,; (i= 
1,2,...,k). The direct sum of these matrices, denoted by BŽ Ap is a 
partitioned matrix of order (L*_,m,) x (£‘_,n;) that has the block-diagonal 
form 


k 
® A, = Diag(A,,A,,...,A,)- 
i=1 
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The following properties can be easily shown on the basis of the preceding 
definition: 


1. DŽ A; + ŽB, E @* (A, +B,), if A; and B; are of the same order 
for 1=1,2,...,k. 

2. [84 A [StB] = É A,B, if A;B; is defined for i=1,2,...,k. 

3. [B5 A] = DŽ A, 

4. (BŽ A) = EE] tr(A)). m 


Definition 2.3.5. Let A=(a;;) be a square matrix of order n Xn. The 
determinant of A, denoted by det(A), is a scalar quantity that can be 
computed iteratively as 


det(A) = D (-1)’*'a,,det(M,,), (2.3) 
j=l 


where M,, is a submatrix of A obtained by deleting row 1 and column j 
(j= 1,2,...,n). For each j, the determinant of M,, is obtained in terms of 
determinants of matrices of order (n — 2) X (n — 2) using a formula similar to 
(2.3). This process is repeated several times until the matrices on the 
right-hand side of (2.3) become of order 2 X 2. The determinant of a 2 x 2 
matrix such as b= (b;;) is given by det(B) =b,,b,, —b,,b,,. Thus by an 
iterative application of formula (2.3), the value of det(A) can be fully 
determined. For example, let A be the matrix 


1 2: =1 
5 0 3 |. 


1 2 1 


Then det(A) = det(A,) — 2det(A,) — det(A;), where A}, A}, A, are 2 X 2 sub- 
matrices, namely 


_fo 3 aie 3 _[5 0 
act ab cli ib sli ol 


It follows that det(A) = — 6 — 2(2) — 10 = —20. Oo 


Definition 2.3.6. Let A=(a;;) be a square matrix order of n Xn. The 
determinant of M,;, the submatrix obtained by deleting row i and column J, 
is called a minor of A of order n — 1. The quantity (— 1)'*/ det(M, j is called 
a cofactor of the corresponding (i, j)th element of A. More generally, if A is 
an m Xn matrix and if we strike out all but p rows and the same number of 
columns from A, where p < min(m, n), then the determinant of the resulting 


submatrix is called a minor of A of order p. 
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The determinant of a principal submatrix of a square matrix A is called a 
principal minor. If, however, we have a leading principal submatrix, then its 
determinant is called a leading principal minor. o 


NOTE 2.3.1. The determinant of a matrix A is defined only when A is a 
square matrix. 


NoTE 2.3.2. The expansion of det(A) in (2.3) was carried out by multiply- 
ing the elements of the first row of A by their corresponding cofactors and 
then summing over j (= 1,2,..., n). The same value of det(A) could have also 
been obtained by similar expansions according to the elements of any row of 
A (instead of the first row), or any column of A. Thus if M,, is a submatrix of 
A obtained by deleting row i and column j, then det(A) can be obtained 
by using any of the following expansions: 


n 

By row i: det(A) = X} (-1)'"a,,det(M,,), i= 1,2,...,n. 
j=l 
n 


By column j: det(A) = È (—1)'a,,det(M,;), 7 =1,2,...,n. 


i=1 


NOTE 2.3.3. Some of the properties of determinants are the following: 


i. det(AB) = det(A)det(B), if A and B are n Xn matrices. 
ii. If A’ is the transpose of A, then det(A’) = det(A). 
iii. If A is an n Xn matrix and a is a scalar, then det(aA) = a” det(A). 
iv. If any two rows (or columns) of A are identical, then det(A) = 0. 
v. If any two rows (or columns) of A are interchanged, then det(A) is 
multiplied by — 1. 
vi. If det(A) = 0, then A is called a singular matrix. Otherwise, A is a 
nonsingular matrix. 
vii. If A and B are matrices of orders m X m and n Xn, respectively, then 
the following hold: (a) det(A ® B) = [det(A)]"[det(B)]”; (b) det(A ® B) 
= [det(A)][det(B)]. 


NOTE 2.3.4. The history of determinants dates back to the fourteenth 
century. According to Smith (1958, page 273), the Chinese had some knowl- 
edge of determinants as early as about 1300 A.D. Smith (1958, page 440) also 
reported that the Japanese mathematician Seki Kowa (1642-1708) had 
discovered the expansion of a determinant in solving simultaneous equations. 
In the West, the theory of determinants is believed to have originated with 
the German mathematician Gottfried Leibniz (1646-1716) in 1693, ten years 
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after the work of Seki Kowa. However, the actual development of the theory 
of determinants did not begin until the publication of a book by Gabriel 
Cramer (1704-1752) (see Price, 1947, page 85) in 1750. Other mathemati- 
cians who contributed to this theory include Alexandre Vandermonde 
(1735-1796), Pierre-Simon Laplace (1749-1827), Carl Gauss (1777-1855), 
and Augustin-Louis Cauchy (1789-1857). Arthur Cayley (1821-1895) is cred- 
ited with having been the first to introduce the common present-day notation 
of vertical bars enclosing a square matrix. For more interesting facts about 
the history of determinants, the reader is advised to read the article by Price 
(1947). 


2.3.2. The Rank of a Matrix 


Let A = (a;;) be a matrix of order m Xn. Let u',u,...,u/,, denote the row 
vectors of A, and let v,,v,,...,v, denote its column vectors. Consider the 
linear spans of the row and column vectors, namely, V, = L(u',w3,..., Wn), V2 
= L(V, ¥5,.--,V,,), respectively. 


Theorem 2.3.1. The vector spaces V, and V, have the same dimension. 


Proof. See Lancaster (1969, Theorem 1.15.1), or Searle (1982, Section 6.6). 
Oo 


Thus, for any matrix A, the number of linearly independent rows is the 
same as the number of linearly independent columns. 


Definition 2.3.7. The rank of a matrix A is the number of its linearly 
independent rows (or columns). The rank of A is denoted by r(A). Oo 


Theorem 2.3.2. If a matrix A has a nonzero minor of order r, and if all 
minors of order r+ 1 and higher (if they exist) are zero, then A has rank r. 


Proof. See Lancaster (1969, Lemma 1, Section 1.15). Oo 


For example, if A is the matrix 


2 3° = 1 
A= 0 1 2 ’ 
2 4 1 


then r(A) = 2. This is because det(A) = 0 and at least one minor of order 2 is 
different from zero. 
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There are several properties associated with the rank of a matrix. Some of 
these properties are the following: 


1. r(A) =r(A’). 

2. The rank of A is unchanged if A is multiplied by a nonsingular matrix. 
Thus if A is an m X n matrix and P is an n X n nonsingular matrix, then 
r(A) = r(AP). 

3. r(A) = r(AA’) = r(A’A). 

4. If the matrix A is partitioned as A=[A,:A,], where A, and A, are 
submatrices of the same order, then r(A; +A,) <r(A) <r(A,)+r(A,). 
More generally, if the matrices A,,A,,...,A, are of the same order 
and if A is partitioned as A=[A,:A,:-++:A,], then 


k 


<r(A) < È r(A;). 


i=1 


k 
| VA; 
i=1 


5. If the product AB is defined, then r(A) + r(B) -n <r(AB) < 
min{r(A), r(B)}, where n is the number of columns of A (or the number 
of rows of B). 

6. r(A 8 B) =r(A)r(B). 

7. r(A ® B) =r(A) +r(B). 


Definition 2.3.8. Let A be a matrix of order m Xn and rank r. Then we 
have the following: 


1. A is said to have a full row rank if r=m <n. 

2. A is said to have a full column rank if r =n <m. 

3. A is of full rank if r=m =n. In this case, det(A) #0, that is, A is a 
nonsingular matrix. o 


2.3.3. The Inverse of a Matrix 


Let A= (a; p) be a nonsingular matrix of order n Xn. The inverse of A, 
denoted by A™!, is an n Xn matrix that satisfies the condition AA~' = A7'A 
=I,. 

The inverse of A can be computed as follows: Let c,; be the cofactor of a;; 
(see Definition 2.3.6). Define the matrix C as C = (c, j): The transpose of C is 
called the adjugate or adjoint of A and is denoted by adj A. The inverse of A 
is then given by 


_, adjA 
~ det(A) ` 
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It can be verified that 


n° 


ENNET 


det(A) | | det(A) h ef 


For example, if A is the matrix 


2 0 1 
A=]|-3 2 Of, 
2 1 1 
then det(A) = —3, and 
i 28 
adj A = 3 0 -3 
-7 -2 4 


Hence, 


a 
ll 
| 
= 
wl D 


Some properties of the inverse operation are given below: 


. (AB)"' =B7'AT!. 

. (A) =(ATt). 

. det(A~!) = 1/det(A). 
(AI! =A. 
.(A@B)'=A7' 8 B™!. 
. (ASB) ' =A! B™!. 
. If A is partitioned as 


IAW BR WN 


where A;; is of order n; Xn; (i, j = 1,2), then 


det(A) -det(A — A AṢA) if Aj, is nonsingular, 


det(A) = 
(a) det(A,,)-det(A,, —A,,A>'A>,) if Ay is nonsingular. 
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The inverse of A is partitioned as 


where 


By, = (An = ApAWAn) > 


BY. = -B Ap Az, 
B, = -AA Buy, 


Ba = Ay + A An By AyAn - 


2.3.4. Generalized Inverse of a Matrix 


This inverse represents a more general concept than the one discussed in the 
previous section. Let A be a matrix of order m Xn. Then, a generalized 
inverse of A, denoted by A’, is a matrix of order n Xm that satisfies the 
condition 


AAA =A. (2.4) 


Note that A~ is defined even if A is not a square matrix. If A is a square 
matrix, it does not have to be nonsingular. Furthermore, condition (2.4) can 
be satisfied by infinitely many matrices (see, for example, Searle, 1982, 
Chapter 8). If A is nonsingular, then (2.4) is satisfied by only A~'. Thus A`! 
is a special case of A7. 


Theorem 2.3.3. 


1. If A is a symmetric matrix, then A~ can be chosen to be symmetric. 
2. A(A’A)” A'A = A for any matrix A. 
3. A(A’A)~ A’ is invariant to the choice of a generalized inverse of A'A. 


Proof. See Searle (1982, pages 221-222). o 


2.3.5. Eigenvalues and Eigenvectors of a Matrix 


Let A be a square matrix of order n X n. By definition, a scalar À is said to be 
an eigenvalue (or characteristic root) of A if A — AI, is a singular matrix, that 
is, 


det(A — AI,) =0. (2.5) 
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Thus an eigenvalue of A satisfies a polynomial equation of degree n called 
the characteristic equation of A. If A is a multiple solution (or root) of 
equation (2.5), that is, (2.5) has several roots, say m, that are equal to A, then 
A is said to be an eigenvalue of multiplicity m. 

Since r(A — AI,) <n by the fact that A — AI, is singular, the columns of 
A-— AI, must be linearly related. Hence, there exists a nonzero vector v such 
that 


(A— AI, )v =0, (2.6) 
or equivalently, 
Av = Av. (2.7) 


A vector satisfying (2.7) is called an eigenvector (or a characteristic vector) 
corresponding to the eigenvalue A. From (2.7) we note that the linear 
transformation of v by the matrix A is a scalar multiple of v. 

The following theorems describe certain properties associated with eigen- 
values and eigenvectors. The proofs of these theorems can be found in 
standard matrix algebra books (see the annotated bibliography). 


Theorem 2.3.4. A square matrix A is singular if and only if at least one of 
its eigenvalues is equal to zero. In particular, if A is symmetric, then its rank 
is equal to the number of its nonzero eigenvalues. 


Theorem 2.3.5. The eigenvalues of a symmetric matrix are real. 


Theorem 2.3.6. Let A be a square matrix, and let à, A,,..., A, denote its 
distinct eigenvalues. If v,,v,,...,v, are eigenvectors of A corresponding 
to A,, A,,..., Ax, respectively, then v,,v>,...,v, are linearly independent. In 
particular, if A is symmetric, then v,,v,,...,v, are orthogonal to one another, 
that is, v; v; = 0 for i #j (i, j =1,2,...,k). 


Theorem 2.3.7. Let A and B be two matrices of orders m X m and n Xn, 
respectively. Let A,, A,,..., Àn be the eigenvalues of A, and v,,v,,...,v, be 
the eigenvalues of B. Then we have the following: 


1. The eigenvalues of A@B are of the form AÀ;v; G=1,2,...,m; j= 
1,2,...,7). 

2. The eigenvalues of A® B are Ay, A25... Am3 Vis Vose.. Ye 
Theorem 2.3.8. Let A,,A,,...,A, be the eigenvalues of a matrix A of 

order n Xn. Then the following hold: 


1. tr(A) =", A; 
2. det(A) = TT", À; 
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Theorem 2.3.9. Let A and B be two matrices of orders m Xn and n Xm 
(n =m), respectively. The nonzero eigenvalues of BA are the same as those 
of AB. 


2.3.6. Some Special Matrices 


1. The vector 1, is a column vector of ones of order n X 1. 
2. The matrix J„ is a matrix of ones of order n Xn. 


3. Idempotent Matrix. A square matrix A for which A’ =A is called an 
idempotent matrix. For example, the matrix A = I„ — (1/n)J„ is idem- 
potent of order n Xn. The eigenvalues of an idempotent matrix are 
equal to zeros and ones. It follows from Theorem 2.3.8 that the rank of 
an idempotent matrix, which is the same as the number of eigenvalues 
that are equal to 1, is also equal to its trace. Idempotent matrices are 
used in many applications in statistics (see Section 2.4). 

4. Orthogonal Matrix. A square matrix A is orthogonal if A'A = I. From 
this definition it follows that (i) A is orthogonal if and only if A’ = A7!; 
(ii) |det(A)| = 1. A special orthogonal matrix is the Householder matrix, 
which is a symmetric matrix of the form 


H =I — 2w’ /u'u, 


where u is a nonzero vector. Orthogonal matrices occur in many 
applications of matrix algebra and play an important role in statistics, 
as will be seen in Section 2.4. 


2.3.7. The Diagonalization of a Matrix 


Theorem 2.3.10 (The Spectral Decomposition Theorem). Let A be a 
symmetric matrix of order n Xn. There exists an orthogonal matrix P such 
that A = PA P’, where A = Diag(à, A,,..., À„) is a diagonal matrix whose 
diagonal elements are the eigenvalues of A. The columns of P are the 
corresponding orthonormal eigenvectors of A. 


Proof. See Basilevsky (1983, Theorem 5.8, page 200). Oo 


If P is partitioned as P=[p,:p,:-*-:p,,], where p; is an eigenvector of A 
with eigenvalue A; (i= 1,2,..., n), then A can be written as 
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For example, if 


then A has two distinct eigenvalues, A, = 0 of multiplicity 2 and A, = 5. For 
A, =0 we have two orthonormal eigenvectors, p; = (2,0,1) / V5 and p,= 
(0,1,0)’. Note that p, and p, span the kernel (null space) of the linear 
transformation represented by A. For A, = 5 we have the normal eigenvector 
p; = (1,0,— 2)'/V5, which is orthogonal to both p; and p,. Hence, P and A 
in Theorem 2.3.10 for the matrix A are 


2, 2 
v5 v5 
P=| 0 1 0} 
1 -2 
— 0 — 
v5 v5 


A = Diag(0,0,5). 


The next theorem gives a more general form of the spectral decomposition 
theorem. 


Theorem 2.3.11 (The Singular-Value Decomposition Theorem). Let A be 
a matrix of order m X n (m <n) and rank r. There exist orthogonal matrices 
P and Q such that A = P[D:0]Q’, where D = Diag(A,, A,,..., A,,) is a diago- 
nal matrix with nonnegative diagonal elements called the singular values of 
A, and 0 is a zero matrix of order m X (n — m). The diagonal elements of D 
are the square roots of the eigenvalues of AA’. 


Proof. See, for example, Searle (1982, pages 316-317). o 


2.3.8. Quadratic Forms 


Let A = (a,j) be a symmetric matrix of order n X n, and let x = (x4, X2,...,,,)' 
be a column vector of order n X 1. The function 


q(x) = x'Ax 


i=1 j 


n 
A; ,X;X; 
=1 


is called a quadratic form in x. 
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A quadratic form x’Ax is said to be the following: 


1. Positive definite if x’Ax > 0 for all x #0 and is zero only if x = 0. 

2. Positive semidefinite if x’Ax > 0 for all x and x’Ax = 0 for at least one 
nonzero value of x. 

3. Nonnegative definite if A is either positive definite or positive semi- 
definite. 


Theorem 2.3.12. Let A=(a;;) be a symmetric matrix of order n Xn. 
Then A is positive definite if and only if either of the following two conditions 
is satisfied: 


1. The eigenvalues of A are all positive. 
2. The leading principal minors of A are all positive, that is, 


a a 
2 EWS Oca det(A) 0; 


a >0, i | 


45, Any 


Proof. The proof of part 1 follows directly from the spectral decomposi- 
tion theorem. For the proof of part 2, see Lancaster (1969, Theorem 2.14.4). 
Oo 


Theorem 2.3.13. Let A=(a;;) be a symmetric matrix of order n Xn. 
Then A is positive semidefinite if and only if its eigenvalues are nonnegative 
with at least one of them equal to zero. 


Proof. See Basilevsky (1983, Theorem 5.10, page 203). m 


2.3.9. The Simultaneous Diagonalization of Matrices 


By simultaneous diagonalization we mean finding a matrix, say Q, that can 
reduce several square matrices to a diagonal form. In many situations there 
may be a need to diagonalize several matrices simultaneously. This occurs 
frequently in statistics, particularly in analysis of variance. 

The proofs of the following theorems can be found in Graybill (1983, 
Chapter 12). 


Theorem 2.3.14. Let A and B be symmetric matrices of order n X n. 


1. If A is positive definite, then there exists a nonsingular matrix Q such 
that Q'AQ =I, and Q'BQ = D, where D is a diagonal matrix whose 
diagonal elements are the roots of the polynomial equation det(B — AA) 
=0. 
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2. If A and B are positive semidefinite, then there exists a nonsingular 
matrix Q such that 


Q’AQ=D,, 
Q’'BQ T D,, 


where D, and D, are diagonal matrices (for a detailed proof of this 
result, see Newcomb, 1960). 


Theorem 2.3.15. Let A,,A,,...,A, be symmetric matrices of order n Xn. 
Then there exists an orthogonal matrix P such that 


A,=PA,P’, i=1,2,...,k, 


where A; is a diagonal matrix, if and only if A;A;=A,A; for all 1 4] 
(i,j =1,2,...,k). 


2.3.10. Bounds on Eigenvalues 


Let A be a symmetric matrix of order n X n. We denote the ith eigenvalue of 
A by eA), i=1,2,...,n. The smallest and largest eigenvalues of A are 
denoted by e,,,,(A) and e,,,,(A), respectively. 


min 


Theorem 2.3.16. (A) < x’Ax/x’x < e nax (A). 


€ min max ( 


Proof. This follows directly from the spectral decomposition theorem. O 


The ratio x’Ax/x’x is called Rayleigh’s quotient for A. The lower and 
upper bounds in Theorem 2.3.16 can be achieved by choosing x to be an 


eigenvector associated with e,,,,(A) and e,,,,(A), respectively. Thus Theorem 
2.3.16 implies that 
_ _ | x’Ax 
int | a |- ent) (2.8) 
"Ax 
Se | ae | = max (A). (2.9) 


Theorem 2.3.17. If A is a symmetric matrix and B is a positive definite 
matrix, both of order n Xn, then 


1 


B'A n B'A 
(AE < <e 
min ( ) x’ Bx max ( ) 


Proof. The proof is left to the reader. o 
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Note that the above lower and upper bounds are equal to the infimum and 
supremum, respectively, of the ratio x’Ax/x’Bx for x + 0. 


Theorem 2.3.18. If A is a positive semidefinite matrix and B is a positive 


definite matrix, both of order n Xn, then for any i (i= 1,2,..., 7), 
€;(A)e nin(B) < e;(AB) < e;(A)e nax (B). (2.10) 
Furthermore, if A is positive definite, then for any i (i = 1,2,..., n), 
e? (AB) e; (AB) 


<e,(A)e,(B) < 


€ max (A)en, ax (B) Cmin( A) Cmin( B) 


Proof. See Anderson and Gupta (1963, Corollary 2.2.1). o 
A special case of the double inequality in (2.10) is 
€ min (A) Emin (B) < e;(AB) < e maxs (A) emas (B), 
for all i (i= 1,2,..., n). 


Theorem 2.3.19. Let A and B be symmetric matrices of order n Xn. 
Then, the following hold: 


1. e(A) <e;(A + B), i=1,2,...,x, if B is nonnegative definite. 
2. e(A) <e(A+B), i=1,2,...,n, if B is positive definite. 


Proof. See Bellman (1970, Theorem 3, page 117). o 


Theorem 2.3.20 (Schur’s Theorem). Let A = (a;;) be a symmetric matrix 
of order n Xn, and let ||Al|2 denote its Euclidean norm, defined as 


n n 1/2 
lAll2 = | 3 Èa) 


i=1 j=1 


Then 


e? (A) = IAI. 


ei 


i 


Proof. See Lancaster (1969, Theorem 7.3.1). o 
Since ||A|l, < n S then from Theorem 2.3.20 we conclude that 


€ max ijl 


(A)| <nmax|a 
ij 
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Theorem 2.3.21. Let A be a symmetric matrix of order n Xn, and let m 
and s be defined as 


tr(A) | tr(A2) o) i 
= ; s= -m 
n n 
Then 
m—s(n—- 1) e min (A) < m— — 
(n-1) 
+ —7< <Cme (A) <m +5(n- 1)”, 
(n-1) 
e max (A) = Emin (A) <5(2n)'”. 
Proof. See Wolkowicz and Styan (1980, Theorems 2.1 and 2.5). o 


2.4. APPLICATIONS OF MATRICES IN STATISTICS 


The use of matrix algebra is quite prevalent in statistics. In fact, in the areas 
of experimental design, linear models, and multivariate analysis, matrix 
algebra is considered the most frequently used branch of mathematics. 
Applications of matrices in these areas are well documented in several books, 
for example, Basilevsky (1983), Graybill (1983), Magnus and Neudecker 
(1988), and Searle (1982). We shall therefore not attempt to duplicate the 
material given in these books. 
Let us consider the following applications: 


2.4.1. The Analysis of the Balanced Mixed Model 


In analysis of variance, a linear model associated with a given experimental 
situation is said to be balanced if the numbers of observations in the 
subclasses of the data are the same. For example, the two-way crossed-classi- 
fication model with interaction, 


Y= Utat B+ (aB)it Eijk (2.11) 


i=1,2,...,a; j=1,2,...,b; k=1,2,...,n, is balanced, since there are n 
observations for each combination of i and j. Here, a; and £, represent the 
main effects of the factors under consideration, (aB);; denotes the interac- 
tion effect, and €;,, is a random error term. Model (2.11) can be written in 
vector form as 


y = H,7) + Hir, + Hir, + Hir, + Hy, (2.12) 
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where y is the vector of observations, Tọ = u, T,=(a4, @,..., Q@,)', T= 
(Bi Boy ey Bes T3 =[@b)i (OB ppc (eB) yl and T,= 
(€11 €125- --> gon)» The matrices H, (@ =0,1,2,3,4) can be expressed as 


direct products of the form 
H,=1, 21, @1,, 
H, =1, 91, @1,, 
H,=1, 91, @1,, 
H,=I1,@1,91,, 
H,=I1, 81,8 1,. 
In general, any balanced linear model can be written in vector form as 


v 


1=0 
where H, (/=0,1,..., v) is a direct product of identity matrices and vectors 
of ones (see Khuri, 1982). If 7),7,,...,7% (0< v— 1) are fixed unknown 
parameter vectors (fixed effects), and 741,7%42,---,7, are random vectors 


(random effects), then model (2.11) is called a balanced mixed model. 
Furthermore, if we assume that the random effects are independent and have 
the normal distributions N(0, ofl, ), where c, is the number of columns of 
H, /=6+1,6+2,...,v, then, because model (2.11) is balanced, its statisti- 
cal analysis becomes very simple. Here, the o/’s are called the model’s 
variance components. A balanced mixed model can be written as 


y=Xg+Zh (2.14) 


where Xg = /_,H,7; is the fixed portion of the model, and Zh = L_,.., H,7; 
is its random portion. The variance-covariance matrix of y is given by 


z= Li Af, 
1=6+1 
where A,=H,H), (= 6+1,0+2,...,v). Note that A,A,=A,A, for all 
1 #p. Hence, the matrices A, can be diagonalized simultaneously (see Theo- 
rem 2.3.15). 

If y'Ay is a quadratic form in y, then y’Ay is distributed as a noncentral 
chi-squared variate x} (n) if and only if AX is idempotent of rank m, where 
7 is the noncentrality parameter and is given by n= g'X’AXg (see Searle, 
1971, Section 2.5). 

The total sum of squares, y’y, can be uniquely partitioned as 


Vv 


yy= } y'By, 
1=0 
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where the Ps are idempotent matrices such that P,P, =0 for all /#s (see 
Khuri, 1982). The quadratic form y'P,y (/ =0,1,..., v) is positive semidefi- 
nite and represents the sum of squares for the /th effect in model (2.13). 


Theorem 2.4.1. Consider the balanced mixed model (2.14), where the 
random effects are assumed to be independently and normally distributed 
with zero means and variance-covariance matrices ofl, ((=0+1,0+ 
2,..., v). Then we have the following: 


pa 


y'Py,y'P y, ...,y'P,y are statistically independent. 


t 


y'P,y/ ô, is distributed as a noncentral chi-squared variate with degrees 
of freedom equal to the rank of P, and noncentrality parameter given 
by ņ = g'X'P,Xg/ô, for /=0,1,..., 6, where 6, is a particular linear 
combination of the variance components 0.1, 0j125---, 0, . However, 
for /=0+1,0+2,...,v, that is, for the random effects, y’P,y/6, is 
distributed as a central chi-squared variate with m, degrees of freedom, 
where m,=r(P,). 


Proof. See Theorem 4.1 in Khuri (1982). o 


Theorem 2.4.1 provides the basis for a complete analysis of any balanced 
mixed model, as it can be used to obtain exact tests for testing the signifi- 
cance of the fixed effects and the variance components. 

A linear function a'g, of g in model (2.14), is estimable if there exists a 
linear function, c’y, of the observations such that E(c'y)= a'g. In Searle 
(1971, Section 5.4) it is shown that a’g is estimable if and only if a’ belongs to 
the linear span of the rows of X. In Khuri (1984) we have the following 
theorem: 


Theorem 2.4.2. Consider the balanced mixed model in (2.14). Then we 
have the following: 


1. r(P,X) =r(P)), 1 =0,1,..., 6. 

2. r(X) = X? rŒ X). 

3. P Xg, P Xg,..., P, Xg are linearly independent and span the space of all 
estimable linear functions of g. 


Theorem 2.4.2 is useful in identifying a basis of estimable linear functions 
of the fixed effects in model (2.14). 


2.4.2. The Singular-Value Decomposition 


The singular-value decomposition of a matrix is far more useful, both in 
statistics and in matrix algebra, then is commonly realized. For example, it 
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plays a significant role in regression analysis. Let us consider the linear 
model 


y=XBre, (2.15) 


where y is a vector of n observations, X is an n X p (n > p) matrix consisting 
of known constants, B is an unknown parameter vector, and e is a random 
error vector. Using Theorem 2.3.11, the matrix X’ can be expressed as 


X’ = P[D:0]Q’, (2.16) 


where P and Q are orthogonal matrices of orders p Xp and n Xn, respec- 
tively, and D is a diagonal matrix of order p Xp consisting of nonnegative 
diagonal elements. These are the singular values of X (or of X’) and are the 
positive square roots of the eigenvalues of X'X. From (2.16) we get 


X= ofp [P (2.17) 


If the columns of X are linearly related, then they are said to be 
multicollinear. In this case, X has rank r (< p), and the columns of X belong 
to a vector subspace of dimension r. At least one of the eigenvalues of X’X, 
and hence at least one of the singular values of X, will be equal to zero. In 
practice, such exact multicollinearities rarely occur in statistical applications. 
Rather, the columns of X may be “nearly” linearly related. In this case, the 
rank of X is p, but some of the singular values of X will be “near zero.” We 
shall use the term multicollinearity in a broader sense to describe the latter 
situation. It is also common to use the term “ill conditioning” to refer to the 
same situation. 

The presence of multicollinearities in X can have adverse effects on the 
least-squares estimate, B, of B in (2.15). This can be easily seen from the fact 
that B =(X'X) X'y and Var(B) = (X'X) 'o?, where ø? is the error vari- 
ance. Large variances associated with the elements of 6 can therefore be 
expected when the columns of X are multicollinear. This causes B to become 
an unreliable estimate of B. For a detailed study of multicollinearity and its 
effects, see Belsley, Kuh, and Welsch (1980, Chapter 3), Montgomery and 
Peck (1982, Chapter 8), and Myers (1990, Chapter 3). 

The singular-value decomposition of X can provide useful information for 
detecting multicollinearity, as we shall now see. Let us suppose that the 
columns of X are multicollinear. Because of this, some of the singular values 
of X, say p, (<p) of them, will be “near zero.” Let us partition D in (2.17) as 


D 0 


D= 
0 D, 


pi 
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where D; and D, are of orders p, Xp, and p, Xp, (pı =p —pz), respec- 
tively. The diagonal elements of D, consist of those singular values of X 
labeled as “near zero.” Let us now write (2.17) as 


D 0 
XP=Q/ 0 D,|. (2.18) 
0 0 


Let us next partition P and Q as P =[P,:P,],Q =[Q,: Q,], where P, and 
P, have p, and p, columns, respectively, and Q, and Q, have p, and n—p, 
columns, respectively. From (2.18) we conclude that 


XP, = Q,D,, (2.19) 
XP, ~ 0, (2.20) 


where = represents approximate equality. The matrix XP, is “near zero” 
because of the smallness of the diagonal elements of D,. 

We note from (2.20) that each column of P, provides a “near’’-linear 
relationship among the columns of X. If (2.20) were an exact equality, then 
the columns of P, would provide an orthonormal basis for the null space 
of X. 

We have mentioned that the presence of multicollinearity is indicated by 
the “smallness” of the singular values of X. The problem now is to determine 
what “small” is. For this purpose it is common in Statistics to use the 
condition number of X, denoted by «(X). By definition 


A max 
BS RE 


min 


where Amas and A,,, are, respectively, the largest and smallest singular 
values of X. Since the singular values of X are the positive square roots of the 
eigenvalues of X’X, then «(X) can also be written as 


e max (X'X) 
ST / Cain X) | 


min 


If K(X) is less than 10, then there is no serious problem with multi- 
collinearity. Values of «(X) between 10 and 30 indicate moderate to strong 
multicollinearity, and if « > 30, severe multicollinearity is implied. 

More detailed discussions concerning the use of the singular-value decom- 
position in regression can be found in Mandel (1982). See also Lowerre 
(1982). Good (1969) described several applications of this decomposition in 
Statistics and in matrix algebra. 
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2.4.3. Extrema of Quadratic Forms 


In many statistical problems there is a need to find the extremum (maximum 
or minimum) of a quadratic form or a ratio of quadratic forms. Let us, for 
example, consider the following problem: 

Let X,,X,,...,X,, be a collection of random vectors, all having the same 
number of elements. Suppose that these vectors are independently and 
identically distributed (i.i.d.) as N(w, $), where both p and ¥ are unknown. 
Consider testing the hypothesis Ho: = po versus its alternative H,: p # wo, 
where pmo is some hypothesized value of pm. We need to develop a test 
Statistic for testing Ho. 

The multivariate hypothesis H, is true if and only if the univariate 
hypotheses 


H(A): N'a =N po 


are true for all A #0. A test statistic for testing H,(A) is the following: 


A(X — po) Vn 
o, 


where X = £?_X,/n and S is the sample variance—covariance matrix, which 
is an unbiased estimator of $, and is given by 


= È E-I- 


Large values of t7(X) indicate falsehood of H(A). Since Hy, is rejected if 
and only if H(A) is rejected for at least one A, then the condition to reject 
H, at the a-level is sup, + [f?(X)] >c,, where c, is the upper 100a% point 
of the distribution of sup, , 9l¢?(A)]. But 


ni d'(X— wo) I? 
2(r = 
a ne 
a (X- X- p )'À 
= n sup ( zits Ho) 
N#0 


Ne max [S7 ( X- m)(X- Ho)'] 
by Theorem 2.3.17. 


Now, 
Cmax |57! (X — by )(X — po )'] = ema | (X — no'S (X — wo) 


= (= po)'S™ (X - m). 
by Theorem 2.3.9. 
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Hence, 


ae [t°(A)] =n(X = py)'S71(X — mo) 


is the test statistic for the multivariate hypothesis H,. This is called Hotelling’s 
T’-statistic. Its critical values are obtained in terms of the critical values of 
the F-distribution (see, for example, Morrison, 1967, Chapter 4). 

Another example of using the extremum of a ratio of quadratic forms is in 
the determination of the canonical correlation coefficient between two ran- 
dom vectors (see Exercise 2.26). The article by Bush and Olkin (1959) lists 
several similar statistical applications. 


2.4.4. The Parameterization of Orthogonal Matrices 


Orthogonal matrices are used frequently in statistics, especially in linear 
models and multivariate analysis (see, for example, Graybill, 1961, Chapter 
11; James, 1954). 

The n? elements of an n Xn orthogonal matrix Q are subject to n(n + 1)/2 
constraints because Q’Q =I,. These elements can therefore be represented 
by n? — n(n + 1)/2 =n(n — 1)/2 independent parameters. The need for such 
a representation arises in several situations. For example, in the design of 
experiments, there may be a need to search for an orthogonal matrix that 
satisfies a certain optimality criterion. Using the independent parameters of 
an orthogonal matrix can facilitate this search. Khuri and Myers (1981) 
followed this approach in their construction of a response surface design that 
is robust to nonnormality of the error distribution associated with the 
response function. Another example is the generation of random orthogonal 
matrices for carrying out simulation experiments. This was used by Heiberger, 
Velleman, and Ypelaar (1983) to construct test data with special properties 
for multivariate linear models. Anderson, Olkin, and Underhill (1987) pro- 
posed a procedure to generate random orthogonal matrices. 

Methods to parameterize an orthogonal matrix were reviewed in Khuri 
and Good (1989). One such method is to use the relationship between an 
orthogonal matrix and a skew-symmetric matrix. If Q is an orthogonal matrix 
with determinant equal to one, then it can be written in the form 


T 
Q=e, 


where T is a skew-symmetric matrix (see, for example, Gantmacher, 1959). 
The elements of T above its main diagonal can be used to parameterize Q. 
This exponential mapping is defined by the infinite series 


T? T’ 
e'=I+T+ >t ten. 
2! 3! 
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The exponential parameterization was used in a theorem concerning the 
asymptotic joint density function of the eigenvalues of the sample 
variance—covariance matrix (Muirhead, 1982, page 394). 

Another parameterization of Q is given by 


Q=(I-U)(I+U)", 


where U is a skew-symmetric matrix. This relationship is valid provided that 
Q does not have the eigenvalue —1. Otherwise, Q can be written as 


Q=L(I—U)(I+U) |, 


where L is a diagonal matrix in which each element on the diagonal is either 
1 or —1. Arthur Cayley (1821-1895) is credited with having introduced the 
relationship between Q and U. 

Finally, the recent article by Olkin (1990) illustrates the strong interplay 
between statistics and linear algebra. The author listed several areas of 
Statistics with a strong linear algebra component. 
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EXERCISES 
In Mathematics 


2.1. Show that a set of nX 1 vectors, u,,uU,,...,U,,, is always linearly 
dependent if m >n. 


2.2. Let W be a vector subspace of V such that W=L(u,,u,,...,u,,), 
where the u,’s (i= 1,2,...,n) are linearly independent. If v is any 
vector in V that is not in W, then the vectors u,,u,,...,u,, V are 
linearly independent. 


2.3. Prove Theorem 2.1.3. 

2.4. Prove part 1 of Theorem 2.2.1. 

2.5. Let T: U >V be a linear transformation. Show that T is one-to-one if 
and only if whenever u,,u,,...,u,, are linearly independent in U, then 


Tu), T(u,),..., 7(u,,) are linearly independent in V. 


2.6. Let T: R, > R,, be represented by an n Xm matrix of rank p. 
(a) Show that dim[T(R,,)] = p. 
(b) Show that if n <m and p=n, then T is one-to-one. 


2.7. Show that tr(A’A) = 0 if and only if A=0. 


2.8. Let A be a symmetric positive semidefinite matrix of order n X n. Show 
that v’Av = 0 if and only if Av = 0. 


2.9. The matrices A and B are symmetric and positive semidefinite of order 
n Xn such that AB = BA. Show that AB is positive semidefinite. 


2.10. If A is a symmetric n Xn matrix, and B is an n Xn skew-symmetric 
matrix, then show that tr(AB) = 0. 


2.11. Suppose that tr(PA) = 0 for every skew-symmetric matrix P. Show that 
the matrix A is symmetric. 
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2.12. 


2.13. 


2.14. 


2.15. 


2.16. 


2.17. 


2.18. 


2.19. 


2.20. 


2.21. 


BASIC CONCEPTS IN LINEAR ALGEBRA 


Let A be an n Xn matrix and C be a nonsingular matrix of order n X n. 
Show that A, C~'AC, and CAC™! have the same set of eigenvalues. 


Let A be an n Xn symmetric matrix, and let A be an eigenvalue of A of 
multiplicity k. Then A — AI, has rank n — k. 


Let A be a nonsingular matrix of order n Xn, and let c and d be n X 1 
vectors. If d'Ate + —1, then 


(A~'c)(d’A~') 
1+d’‘A!e 


(A+ed') =A! 
This is known as the Sherman-Morrison formula. 
Show that if A and I, + V’A~'U are nonsingular, then 
(A+ UV’) =A AU, + VATU) WAT, 


where A is of order n Xn, and U and V are of order n Xk. This result 
is known as the Sherman-Morrison-Woodbury formula and is a general- 
ization of the result in Exercise 2.14. 


Prove Theorem 2.3.17. 


Let A and B be nXn idempotent matrices. Show that A—B is 
idempotent if and only if AB = BA = B. 


Let A be an orthogonal matrix. What can be said about the eigenvalues 
of A? 


Let A be a symmetric matrix of order n Xn, and let L be a matrix of 
order n X m. Show that 


(A)tr(L’L) < tr(L’AL) <e,,,,(A)tr(L’'L) 


€ min max 


Let A be a nonnegative definite matrix of order n Xn, and let L be a 
matrix of order n X m. Show that 

(a) enin L'AL) > e minl Ae min LL), 

b) enal LIAL) < epa (ADC pg (L'L). 

Let A and B be n Xn symmetric matrices with A nonnegative definite. 
Show that 


(B)tr(A) < tr(AB) < e max (B)tr(A). 


€ min max 
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2.22. Let AT be a g-inverse of A. Show that 


(a) A’ A is idempotent, 
(b) r(A~) > r(A), 
(c) r(A) =r(A7A). 


In Statistics 


2.23. 


2.24. 


2.25. 


Let y = (y4, y>,---, Y,)! be a normal random vector N(0, o?I,). Let y 
and s* be the sample mean and sample variance given by 


1 n 
Sa Èy, 
n iz1 
n n 2 
2 1 2 (Xy) 
s = Ye e 
n-1], n 


(a) Show that A is an idempotent matrix of rank n — 1, where A is an 
n Xn matrix such that y'Ay = (n — 1)s?. 
(b) What distribution does (n — 1)s*/a? have? 


(c) Show that y and Ay are uncorrelated; then conclude that y and s? 
are Statistically independent. 


Consider the one-way classification model 
Yij 5 MFQ; + &;, i=1,2,...,a; j=1,2,...,;, 
where u and a; (i=1,2,...,a) are unknown parameters and ¢;; is a 


random error with a zero mean. Show that 


(a) a;— a; is an estimable linear function for all i#i’ (i,i' = 
1,2,..., a), 
(b) u is nonestimable. 


Consider the linear model 

y=XBte, 
where X is a known matrix of order n Xp and rank r (<p), B is an 
unknown parameter vector, and e is a random error vector such that 
E(e€) =0 and Var(e) = o’I,,. 
(a) Show that X(X'X)~ X’ is an idempotent matrix. 
(b) Let l'y be an unbiased linear estimator of X'B. Show that 


Var( aÊ) < Var(I'y), 


where \’B = A'(X'X) X'y. 
The result given in part (b) is known as the Gauss—Markov theorem. 
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2.26. 


2.27. 


BASIC CONCEPTS IN LINEAR ALGEBRA 


Consider the linear model in Exercise 2.25, and suppose that r(X) =p. 
Hoerl and Kennard (1970) introduced an estimator of B called the 
ridge estimator B*: 


B*=(X'X+KI,) X'y, 


where k is a “small” fixed number. For an appropriate value of k, B* 
provides improved accuracy in the estimation of B over the least-squares 
estimator B = (X'X)"!X’y. Let X’X = PAP’ be the spectral decomposi- 
tion of X'X. Show that B* = PDP’B, where D is a diagonal matrix 
whose ith diagonal element is A;/(A; +k), i=1,2,...,p, and where 
Àp A2»... Àp are the diagonal elements of A. 


Consider the ratio 


-A (x'Ay)? 
(x'B:x)(y'B,y) ” 


p 


where A is a matrix of order m Xn and B,,B, are positive definite of 
orders m Xm and n Xn, respectively. Show that 


SUP p° = e max (By "AB; 'A’). 
x,y 


[ Hint: Define C, and C, as symmetric nonsingular matrices such that 
C? = B|, C} =B,. Let C,x =u,C,y =v. Then p° can be written as 


u'C-!AC5!y)> 
p= ( en = (v'Cy'ACz't), 


where v=u/(u'u)’””,7=v/(v'v)'”” are unit vectors. Verify the result 
of this problem after noting that p° is now the square of a dot 
product. ] 


Note: This exercise has the following application in multivariate analy- 
sis: Let z} and z, be random vectors with zero means and variance- 
covariance matrices $41, 2), respectively. Let $}, be the covariance 
matrix of z} and z,. On choosing A= %,B, = %1,,B, =, the 
positive square root of the supremum of p° is called the canonical 
correlation coefficient between z, and z,. It is a measure of the linear 
association between z, and z, (see, for example, Seber, 1984, Section 
5.7). 


CHAPTER 3 


Limits and Continuity of Functions 


The notions of limits and continuity of functions lie at the kernel of calculus. 
The general concept of continuity is very old in mathematics. It had its 
inception long ago in ancient Greece. We owe to Aristotle (384-322 B.C.) 
the first known definition of continuity: “A thing is continuous when of any 
two successive parts the limits at which they touch are one and the same and 
are, as the word implies, held together” (see Smith, 1958, page 93). Our 
present definitions of limits and continuity of functions, however, are sub- 
stantially those given by Augustin-Louis Cauchy (1789-1857). 

In this chapter we introduce the concepts of limits and continuity of 
real-valued functions, and study some of their properties. The domains of 
definition of the functions will be subsets of R, the set of real numbers. 
A typical subset of R will be denoted by D. 


3.1. LIMITS OF A FUNCTION 


Before defining the notion of a limit of a function, let us understand what is 
meant by the notation x > a, where a and x are elements in R. If a is finite, 
then xa means that x can have values that belong to a neighborhood 
N (a) of a (see Definition 1.6.1) for any r > 0, but x + a, that is, 0 < |x — a| < 
r. Such a neighborhood is called a deleted neighborhood of a, that is, a 
neighborhood from which the point a has been removed. If a is infinite (— œ% 
or +œ), then x— a indicates that |x| can get larger and larger without any 
constraint on the extent of its increase. Thus |x| can have values greater than 
any positive number. In either case, whether a is finite or infinite, we say that 
x tends to a or approaches a. 
Let us now study the behavior of a function f(x) as x >a. 


Definition 3.1.1. Suppose that the function f(x) is defined in a deleted 
neighborhood of a point a € R. Then f(x) is said to have a limit L as x >a 


57 


58 LIMITS AND CONTINUITY OF FUNCTIONS 


if for every e> 0 there exists a ô> 0 such that 

f(x) -Ll<e (3.1) 
for all x for which 

0<\|x-a| <ô. (3.2) 


In this case, we write f(x) > L as x >a, which is equivalent to saying that 
lim, ,, f(x) =L. Less formally, we say that f(x) > L as x a if, however 
small the positive number e might be, f(x) differs from L by less than e for 
values of x sufficiently close to a. o 


Note 3.1.1. When f(x) has a limit as x > a, it is considered to be finite. 
If this is not the case, then f(x) is said to have an infinite limit (—% or +) 
as xa. This limit exists only in the extended real number system, which 
consists of the real number system combined with the two symbols, — and 
+, In this case, for every positive number M there exists a ô> 0 such that 
If(x)| >M if 0 < |x —a| < 6. If a is infinite and L is finite, then f(x) > L as 
x — a if for any e> 0 there exists a positive number N such that inequality 
(3.1) is satisfied for all x for which |x| > N. In case both a and L are infinite, 
then f(x) >L as x >a if for any B>0 there exists a positive number A 
such that |f(x)| >B if |x| >A. 


Note 3.1.2. If f(x) has a limit L as x >a, then L must be unique. To 


show this, suppose that L, and L, are two limits of f(x) as x > a. Then, for 
any e> 0 there exist 6, > 0, 6, > 0 such that 


€ 

Fœ) -Ll< 5> if0<|x—a] < ô, 
E . 

|f(x) - Lal <5, if 0 < |x-—al < ô. 


Hence, if ê= min(6,, 5,), then 


L,- L.l =|L, -= f(x) +f(x) -L,| 
<|f(*) -L| +| f(x) -L| 


<E 


for all x for which 0 < |x — a| < ô. Since |L; — L,| is smaller than e, which is 
an arbitrary positive number, we must have L, = L, (why?). 


Note 3.1.3. The limit of f(x) as described in Definition 3.1.1 is actually 
called a two-sided limit. This is because x can approach a from either side. 
There are, however, cases where f(x) can have a limit only when x ap- 
proaches a from one side. Such a limit is called a one-sided limit. 
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By definition, if f(x) has a limit as x approaches a from the left, 
symbolically written as x > a™, then f(x) has a left-sided limit, which we 
denote by L™. In this case we write 


lim f(x) =L™. 


If, however, f(x) has a limit as x approaches a from the right, symbolically 
written as x >a‘, then f(x) has a right-sided limit, denoted by L*, that is, 


lim f(x) =L*. 


From the above definition it follows that f(x) has a left-sided limit L~ as 
x— a` if for every e> 0 there exists a ê> 0 such that 


|f(x) -L|<e 


for all x for which 0 <a — x < ô. Similarly, f(x) has a right-sided limit L* as 
x—a* if for every e> 0 there exists a ô> 0 such that 


|f(%) -L*|<e 


for all x for which 0<x-a<6. 

Obviously, if f(x) has a two-sided limit L as x >a, then L~ and L* both 
exist and are equal to L. Vice versa, if L~=L*, then f(x) has a two-sided 
limit L as x >a, where L is the common value of L~ and L* (why?). We 
can then state that lim f(x) =L if and only if 


lim f(x) = lim f(x) =L. 


Thus to determine if f(x) has a limit as x —> a, we first need to find out if it 
has a left-sided limit L~ and a right-sided limit L* as x — a. If this is the 
case and L-=L*=L, then f(x) has a limit L as x >a. 

Throughout the remainder of the book, we shall drop the characterization 
“two-sided” when making a reference to a two-sided limit L of f(x). Instead, 
we shall simply state that L is the limit of f(x). 


EXAMPLE 3.1.1. Consider the function 
ee (x-1)/(x*-1), te 
5 x=1. 


This function is defined everywhere except at x = — 1. Let us find its limit as 
x — a, where a € R. We note that 


í i x—1 i 1 
EC n x?-1 Tei x+1` 
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This is true even if a=1, because x #a as x >a. We now claim that if 
a + —1, then 


1 1 
lim = ; 
xoaxtl a+1 


To prove this claim, we need to find a ô> 0 such that for any e> 0, 


1 1 | oe 
x+1 atl CP) 

if 0 < |x —al| < ô. Let us therefore consider the following two cases: 

CASE 1. a> —1. In this case we have 
1 1 | Ix — al T 
x+1 atl |lxx+1lļlla+ 1|” Ca 
If |x —a| < ô, then 

a—6+1<x+1<a+6+l. (3.5) 


Since a+1>0, we can choose 6>0 such that a—6+1>0, that is, 
ô<a + 1. From 83.4) and (3.5) we then get 


1 1 | ô 
= < . 
x+1 a+1| (a+1)(a-6+1) 
Let us constrain 6 even further by requiring that 


ô 
asians 


This is accomplished by choosing 6> 0 so that 
(at+1)’e 
1+(at+lje 
Since 
at+l)’e 
ES ai fae 
1+(a+lje 


inequality (3.3) will be satisfied by all x for which |x —a| < 6, where 


(a+1)’e 
DOS ee (3.6) 
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CASE 2. a< —1. Here, we choose 6>0 such that a+ 6+1 <0, that is, 
ô< —(a + 1). From (3.5) we conclude that 


Ix +1] >—-(a+6+1). 


Hence, from (3.4) we get 


1 1 | ô 
— - — < —. 
x+1 a+1| (a+1)(a+ô8+1) 


As before, we further constrain 6 by requiring that it satisfy the inequality 


ô 
<e, 
(at+l)(a+5+1l) S 


or equivalently, the inequality 


(a+1)e 


OL T 
1-—(a+1)e 


Note that 


(a+1)’e 


Gee —(a+1). 


Consequently, inequality (3.3) can be satisfied by choosing 6 such that 


(at+1)’e 
DOS gan dye (3.7) 


Cases 1 and 2 can be combined by rewriting (3.6) and (3.7) using the single 
double inequality 


lat+1|"e 


0 < ô < —. 
1+lļla+1lļe 


If a= —1, then no limit exists as x —> a. This is because 


li = ii : 
me) Kaa eee 
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If x > — 17, then 


fig eS, 
x>-17x+1 


and, as x > — 1t, 


; 1 
lim = 00, 
x->-1t x+1 


Since the left-sided and right-sided limits are not equal, no limit exists as 
x27 -1. 


EXAMPLE 3.1.2. Let f(x) be defined as 


1t+yvx, x20, 
Xs x<0. 


f(x) = 
This function has no limit as x —> 0, since 
lim f(x)= lim x=0, 
x07 x>07 
lim f(x)= lim (1+ vx) =1. 
x>0* x >0* 


However, for any a #0, lim, , f(x) exists. 


EXAMPLE 3.1.3. Let f(x) be given by 
_fxcosx, x#0, 
FA x=0. 


20 


f(x) 


~20 
Figure 3.1. The graph of the function f(x). 0 x 20 
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Then lim, „o f(x) =0. This is true because |f(x)| < |x| in any deleted 
neighborhood of a =0. As x > ~, f(x) oscillates unboundedly, since 


—X <XCOSX<X. 


Thus f(x) has no limit as x > œ. A similar conclusion can be reached when 
x — —© (see Figure 3.1). 


3.2. SOME PROPERTIES ASSOCIATED WITH LIMITS OF FUNCTIONS 


The following theorems give some fundamental properties associated with 
function limits. 


Theorem 3.2.1. Let f(x) and g(x) be real-valued functions defined on 
DCR. Suppose that lim (x) =L and lim, „, g(x) =M. Then 


x7a 


1. lim, , [fl +g] =L+M, 

2. lim, ., f(x) g(x)] = LM, 

3. lim, ,[1/g(x)]=1/M if M #0, 

4. lim, , , f(x) /gQ)] =L/M if M #0. 


Proof. We shall only prove parts 2 and 3. The proof of part 1 is straight- 
forward, and part 4 results from applying parts 2 and 3. 


Proof of Part 2. Consider the following three cases: 


Case 1. Both L and M are finite. Let e>0 be given and let t>0 be 
such that 


T(7T+|L| +|Ml) <e. (3.8) 


This inequality is satisfied by all values of 7 for which 


~(\L| +1Ml) + V(LI 41M)? + 4e 
5 


0<T< 


Now, there exist 5, > 6, > 0 such that 


|f(x)-L|<7r  if0<|x-al <6, 


lge(x) -M|<7r if 0<|x-al <6. 
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Then, for any x such that 0 < |x —a| <6 where ô= min(6,, 5,), 


[f(~)g(4) —LM|=|M[f(x) =L] +f(4)[8(4) - M]| 
< 7|M| +7] f(x)| 
<71M| +7[IL) +| f(x) -L]] 
<7(7+|L| +M) 
<E, 
which proves part 2. 


Case 2. One of L and M is finite and the other is infinite. Without any 
loss of generality, we assume that L is finite and M = œ. Let us also assume 
that L #0, since 0-~ is indeterminate. Let A > 0 be given. There exists a 
6, > 0 such that |f(x)| > |L] /2 if 0 < |x—a| < ô; (why?). Also, there exists 
a ô, >0 such that |g(x)| >2A/|L| if 0 <|x—al <6,. Let 6=min(6,, ô). 
If 0 < |x —a| < ô, then 

FD) =E le) | 
L| 2 
< —— — A =A. 
2 |L| 


This means that lim, _,, f(x)g(x) = ©, which proves part 2. 


Case 3. Both L and M are infinite. Suppose that L = œ, M =, In this 
case, for a given B > 0 there exist x, > 0, K, >0 such that 


lf(x)|>VB  if0<|x-al <x, 
le(x)|>VB  if0<|x-al <p. 
Then, 
|f(x)g(x)|>B, if0<\|x—-al <k, 
where «= min(x,, K,). This implies that lim, , , f(x)g(x) = œ, which proves 


part 2. 


Proof of Part 3. Let e>0 be given. If M +0, then there exists a A, >0 
such that |g(x)| > |M|/2 if 0 < |x —al| < à. Also, there exists a A, > 0 such 
that |g(x) — M| < eM?/2 if 0 < |x —a| < à. Then, 


| 1 1] |g(x)-M| 


g(x) M| [g(x)|IMI 


2|g(x) -M| 
< 2 
|M| 
<E, 
if 0 < |x—a| <A, where à= min(à, A,). o 
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Theorem 3.2.1 is also true if L and M are one-sided limits of f(x) and 
g(x), respectively. 


Theorem 3.2.2. If f(x) < g(x), then lim, _,, f(x) < lim, .,, g(x). 
Proof. Let lim,_,, f(x) =L, lim, „, g(x) =M. Suppose that L-M>0. 


By Theorem 3.2.1, L —M is the limit of the function h(x) =f(x) — g(x). 
Therefore, there exists a ô> 0 such that 


L-M 
h) = (L=M) |<, (3.9) 
if 0 < |x — a| < 8. Inequality (3.9) implies that A(x) > (L — M)/2 > 0, which 
is not possible, since, by assumption, h(x) = f(x) — g(x) <0. We must then 
have L—M <0. Oo 


3.3. THE 0,0 NOTATION 


These symbols provide a convenient way to describe the limiting behavior of 
a function f(x) as x tends to a certain limit. 

Let f(x) and g(x) be two functions defined on D CR. The function g(x) 
is positive and usually has a simple form such as 1, x, or 1/x. Suppose that 
there exists a positive number K such that 


| f(*) |< Kg (x) 


for all x € E, where E CD. Then, f(x) is said to be of an order of magnitude 
not exceeding that of g(x). This fact is denoted by writing 


f(x) = O(8(x)) 


for all x € E. In particular, if g(x) =1, then f(x) is necessarily a bounded 
function on E. For example, 


cos x = O(1) for all x, 
x = O(x”) for large values of x, 
x?+2x=0(x°) for large values of x, 
sin x = O( |x|) for all x. 


The last relationship is true because 


sin x 
<1 


x 


for all values of x, where x is measured in radians. 
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Let us now suppose that the relationship between f(x) and g(x) is such 
that 


f(x) 
= g(x) = 


Then we say that f(x) is of a smaller order of magnitude than g(x) in a 
deleted neighborhood of a. This fact is denoted by writing 


f(x) =0(g(x)) asx>a, 


which is equivalent to saying that f(x) tends to zero more rapidly than g(x) 
as x >a. The o symbol can also be used when x tends to infinity. In this 
case we write 


f(x) = 0(g(*x)) for x >A, 


where A is some positive number. For example, 


x? =0(x) asx 0, 
tan x? =o0( x’) as x > 0, 
vx =0(x) as x > 0%, 


If f(x) and g(x) are any two functions such that 


HO) 
g(x) 


1 as x>a, 


then f(x) and g(x) are said to be asymptotically equal, written symbolically 
f(x) ~ g(x), as x >a. For example, 


x? ~x? +3x+1 asx>o, 
sin x ~x as x > 0. 


On the basis of the above definitions, the following properties can be 
deduced: 


1. OCfCx) + g(x) = O(f(x)) + O(g(x)). 

2. OCf(x) g(x) = OFO). 

3. of f(x)g(x)) = O(f(x))o(g(x)). 

4. If f(x) ~ g(x) as x >a, then f(x) =g(x) + o(g(x)) as x > a. 


3.4. CONTINUOUS FUNCTIONS 


A function f(x) may have a limit L as x —> a. This limit, however, may not be 
equal to the value of the function at x =a. In fact, the function may not even 
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be defined at this point. If f(x) is defined at x =a and L = f(a), then f(x) is 
said to be continuous at x =a. 


Definition 3.4.1. Let f: D —> R, where DCR, and let a € D. Then f(x) 
is continuous at x =a if for every e> 0 there exists a 6 >0 such that 


lf(*) -fla)|<e 


for all x © D for which |x —a| < ô. 

It is important here to note that in order for f(x) to be continuous at 
x =a, it is necessary that it be defined at x =a as well as at all other points 
inside a neighborhood N,(a) of the point a for some r>0. Thus to show 
continuity of f(x) at x =a, the following conditions must be verified: 


1. f(x) is defined at all points inside a neighborhood of the point a. 

2. f(x) has a limit from the left and a limit from the right as x > a, and 
that these two limits are equal to L. 

3. The value of f(x) at x =a is equal to L. 


For convenience, we shall denote the left-sided and right-sided limits of 
f(x) as x >a by f(a~) and f(a”), respectively. 

If any of the above conditions is violated, then f(x) is said to be 
discontinuous at x =a. There are two kinds of discontinuity. o 


Definition 3.4.2. A function f: D > R has a discontinuity of the first kind 
at x=a if f(a) and f(a‘) exist, but at least one of them is different from 
f(a). The function f(x) has a discontinuity of the second kind at the same 
point if at least one of f(a~) and f(a*) does not exist. o 


Definition 3.4.3. A function f: D >R is continuous on ECD if it is 
continuous at every point of E. 
For example, the function 


psd 

FoR x>0,x#1, 
f= 

= x=1, 

2 


is defined for all x > 0 and is continuous at x= 1. This is true because, as 
was shown in Example 3.1.1, 


li = li = 
fee) Si x+1 J:e 


which is equal to the value of the function at x= 1. Furthermore, f(x) is 
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continuous at all other points of its domain. Note that if f(1) were different 
from 4, then f(x) would have a discontinuity of the first kind at x = 1. 
Let us now consider the function 


x+1, x>0, 
f(x) = 40, x=0, 
x-1, x<0. 


This function is continuous everywhere except at x = 0, since it has no limit 
as x — 0 by the fact that f(07) = —1 and f(0*) = 1. The discontinuity at this 
point is therefore of the first kind. 

An example of a discontinuity of the second kind is given by the function 


{ 1 
f(x) = ig x #0, 
0, x=0, 


which has a discontinuity of the second kind at x = 0, since neither f(0~) nor 
f(O*) exists. 


Definition 3.4.4. A function f: D—R is left-continuous at x=a if 
lim f(x) = f(a). It is right-continuous at x =a if lim, _, ,+ f(x) = f(a). 
oO 


xa” 


Obviously, a left-continuous or a right-continuous function is not necessar- 
ily continuous. In order for f(x) to be continuous at x =a it is necessary and 
sufficient that f(x) be both left-continuous and right-continuous at this point. 

For example, the function 


_fx-1, x<0O, 
foxy = {7 x>0 


is left-continuous at x =0, since f(07) = —1 = f(0). If f(x) were defined so 
that f(x) =x — 1 for x <0 and f(x) =1 for x = 0, then it would be right-con- 


tinuous at x = 0. 


Definition 3.4.5. The function f: D > R is uniformly continuous on E C 
D if for every e> 0 there exists a ô> 0 such that 


F(x) —f(42)|<e (3.10) 


for all x4, x, EE for which |x, —x,| < ô. m 


This definition appears to be identical to the definition of continuity. That 
is not exactly the case. Uniform continuity is always associated with a set such 
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as E in Definition 3.4.5, whereas continuity can be defined at a single point 
a. Furthermore, inequality (3.10) is true for all pairs of points x4, x, € E such 
that |x,—x,| <6. Hence, 6 depends only on e, not on the particular 
locations of x,,x,. On the other hand, in the definition of continuity 
(Definition 3.4.1) 6 depends on e as well as on the location of the point 
where continuity is considered. In other words, 6 can change from one point 
to another for the same given e > 0. If, however, for a given e> 0, the same 6 
can be used with all points in some set ECD, then f(x) is uniformly 
continuous on E. For this reason, whenever f(x) is uniformly continuous on 
E, 6 can be described as being “portable,” which means it can be used 
everywhere inside E provided that e> 0 remains unchanged. 

Obviously, if f(x) is uniformly continuous on Æ, then it is continuous 
there. The converse, however, is not true. For example, consider the function 
f: 0,1) >R given by f(x) =1/x. Here, f(x) is continuous at all points of 
E = (0, 1), but is not uniformly continuous there. To demonstrate this fact, let 
us first show that f(x) is continuous on E. Let e> 0 be given and let a E€ E. 
Since a>0, there exists a ô; >0 such that the neighborhood N,(a) is a 
subset of E. This can be accomplished by choosing 6, such that 0 < 6, <a. 
Now, for all x € N;(a), 


1 1 


x a 


7 Ix —a| ôi 


ax a(a — ô) ` 
Let ô, > 0 be such that for the given e> 0, 
5, 
a(a= 05). <e, 
which can be satisfied by requiring that 


2 
0< ô, < 


1+ae` 
Since 
a% 
a, 
1+ae 
then 
ae 3.11 
—--—|< 
par lie (3.11) 
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if |x — a| < ô, where 


a% a% 


ô< min| a, = ; 
1+ae 1+ae 
It follows that f(x) = 1/x is continuous at every point of E. We note here 
the dependence of ô on both e and a. 
Let us now demonstrate that f(x)=1/x is not uniformly continuous on 


E. Define G to be the set 


| a% 
G= ae 
1+ae 


In order for f(x) = 1/x to be uniformly continuous on Æ, the infimum of 
G must be positive. If this is possible, then for a given e> 0, (3.11) will be 
satisfied by all x for which 


Ix —a| <inf(G), 


and for all a € (0,1). However, this cannot happen, since inf(G) = 0. Thus it 
is not possible to find a single ô for which (3.11) will work for all a € (0,1). 

Let us now try another function defined on the same set E =(0,1), 
namely, the function f(x) =x’. In this case, for a given e>0, let 5>0 be 
such that 


6?+28a—€<0. (3.12) 


Then, for any a € E, if |x— a| < ê we get 


|x? —a?| = |x — al |x + al 
= |x—aļ||x—a + 2al] 
< 6(6+2a) <e. (3.13) 


It is easy to see that this inequality is satisfied by all 6> 0 for which 


0<8<-—a+Va +e. (3.14) 
If H is the set 


H={-a+ Va +eļa EE}, 


then it can be verified that inf(H)= —1 + V1+e. Hence, by choosing ô 
such that 
ô<inf(H), 


inequality (3.14), and hence (3.13), will be satisfied for all a €ÆE. The 
function f(x) =x* is therefore uniformly continuous on E. 
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The above examples demonstrate that continuity and uniform continuity 
on a set E are not always equivalent. They are, however, equivalent under 
certain conditions on E. This will be illustrated in Theorem 3.4.6 in the next 
subsection. 


3.4.1. Some Properties of Continuous Functions 


Continuous functions have some interesting properties, some of which are 
given in the following theorems: 


Theorem 3.4.1. Let f(x) and g(x) be two continuous functions defined 
on a set DCR. Then: 


1. f(x) + g(x) and f(x)g(x) are continuous on D. 
2. af(x) is continuous on D, where a is a constant. 
3. f(x)/g(x) is continuous on D provided that g(x) #0 on D. 


Proof. The proof is left as an exercise. o 


Theorem 3.4.2. Suppose that f: D—R is continuous on D, and 
g: f(D)—>R is continuous on f(D), the image of D under f. Then the 
composite function h: D —> R defined as h(x) = g[f(x)] is continuous on D. 


Proof. Let e> 0 be given, and let a € D. Since g is continuous at f(a), 
there exists a ô’ > 0 such that lel f) -gl fa] <e if If) -fal < 5’. 
Since f(x) is continuous at x =a, there exists a ê> 0 such that |f(x) — f(a)| 
< ô' if |x—a| <6. It follows that by taking |x—a| <6 we must have 
|h(x) — h(a)| <e. o 


Theorem 3.4.3. If f(x) is continuous at x=a and f(a) > 0, then there 
exists a neighborhood N;(a) in which f(x) > 0. 


Proof. Since f(x) is continuous at x =a, there exists a ô> 0 such that 
f(x) — f(a) < f(a), 
if |x —a| < 6. This implies that 
f(x) > f(a) >0 


for all x € N,(a). oO 


Theorem 3.4.4 (The Intermediate-Value Theorem). Let f: D—R be 
continuous, and let [a, b] be a closed interval contained in D. Suppose that 


72 LIMITS AND CONTINUITY OF FUNCTIONS 


f(a) < f(b). If à is a number such that f(a) <A<f(b), then there exists a 
point c, where a <c <b, such that à = f(c). 


Proof. Let g: D—>R be defined as g(x)=f(x)— A. This function is 
continuous and is such that g(a) <0, g(b) > 0. Consider the set 


S={x€[a,b]lg(x) <0}. 


This set is nonempty, since a € S and is bounded from above by b. Hence, by 
Theorem 1.5.1 the least upper bound of S exists. Let c= lub(S). Since 
S c[a, b], then c € [a,b]. 

Now, for every positive integer n, there exists a point x„ S S such that 


1 
c-—— <x, <c. 
n 


Otherwise, if x <c — 1/n for all x E€ S, then c — 1/n will be an upper bound 
of S, contrary to the definition of c. Consequently, lim,,_,., x, = c. Since g(x) 
is continuous on [a, b], then 


g(c) = lim g(x,) <0, (3.15) 


by Theorem 3.2.2 and the fact that g(x,,) <0. From (3.15) we conclude that 
c <b, since g(b)>0. 

Let us suppose that g(c)<0. Then, by Theorem 3.4.3, there exists a 
neighborhood N,(c), for some ê> 0, such that g(x) <0 for all x € N;(c) NA 
[a,b]. Consequently, there exists a point x, € [a, b] such that c <x <c +ô 
and g(x,)<0. This means that x) belongs to S$ and is greater than c, 
a contradiction. Therefore, by inequality (3.15) we must have g(c) = 0, that 
is, f(c)=A. We note that c>a, since c >a, but c #a. This last is true 
because if a =c, then g(c) <0, a contradiction. This completes the proof of 
the theorem. o 


The direct implication of the intermediate-value theorem is that a continu- 
ous function possesses the property of assuming at least once every value 
between any two distinct values taken inside its domain. 


Theorem 3.4.5. Suppose that f: D—R is continuous and that D is 
closed and bounded. Then f(x) is bounded in D. 


Proof. Let a be the greatest lower bound of D, which exists because D is 
bounded. Since D is closed, then a € D (why?). Furthermore, since f(x) 
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is continuous, then for a given e> 0 there exists a 5, > 0 such that 
f(a) —e<f(x) <f(a) +e 


if |x — a| < ô. The function f(x) is therefore bounded in N5(a). Define ¥ to 
be the set 


A = {x E€ D|f(x) is bounded}. 


This set is nonempty and bounded, and N;(a) MD cy. We need to show 
that D — is an empty set. 

As before, the least upper bound of æ exists (since it is bounded) and 
belongs to D (since D is closed). Let c = lub(.x/). By the continuity of f(x), 
there exists a neighborhood N;(c) in which f(x) is bounded for some ô, > 0. 
If D —# is nonempty, then N3{c) A (D —.#) is also nonempty [if N;(c) CA, 
then c+ (8,/2) E€, a contradiction]. Let x) € N; (c) N (D —). Then, on 
one hand, f(x) is not bounded, since x € (D —./). On the other hand, 
f(xo) must be bounded, since x) € N;(c). This contradiction leads us to con- 
clude that D —.% must be empty and that f(x) is bounded in D. Oo 


Corollary 3.4.1. If f: D—R is continuous, where D is closed and 
bounded, then f(x) achieves its infimum and supremum at least once in D, 
that is, there exists €,7 € D such that 

f(€) <f(x) for all x €D, 


f(n) =f(x) for all x€ D. 
Equivalently, 
FCE) = inf f(x), 
f(n) = sup f(x). 
xED 


Proof. By Theorem 3.4.5, f(D) is a bounded set. Hence, its least upper 
bound exists. Let M = lub f(D), which is the same as sup, e p f(x). If there 
exists no point x in D for which f(x) =M, then M — f(x) > 0 for all x € D. 
Consequently, 1/[M —f(x)] is continuous on D by Theorem 3.4.1, and is 
hence bounded there by Theorem 3.4.5. 

Now, if 6 > 0 is any given positive number, we can find a value x for which 
f(x) > M- 6, or 


1 
M-F) 5 


This implies that 1/[M —f(x)] is not bounded, a contradiction. Therefore, 
there must exist a point n € D at which f(n) = M. 
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The proof concerning the existence of a point €<D such that f(é)= 
inf, <p f(x) is similar. o 


The requirement that D be closed in Corollary 3.4.1 is essential. For 
example, the function f(x) = 2x — 1, which is defined on D = {x|0 <x < 1}, 
cannot achieve its infimum, namely — 1, in D. For if there exists a €<= D such 
that f(£) < 2x — 1 for all x € D, then there exists a 6 > 0 such that 0 < €- 6. 
Hence, 


f(€— 6) =2€—-26-1<f(é), 


a contradiction. 


Theorem 3.4.6. Let f: D—R be continuous on D. If D is closed and 
bounded, then f is uniformly continuous on D. 


Proof. Suppose that f is not uniformly continuous. Then, by using the 
logical negation of the statement concerning uniform continuity in Definition 
3.4.5, we may conclude that there exists an e> 0 such that for every 6 > 0, we 
can find x, x, ED with |x, —x,| < ô for which |f(x,) —f(x,)| = e. On this 
basis, by choosing ô= 1, we can find u,,v, E D with |u; —v,| <1 for which 
|f(u,) — f(v,)| = e. Similarly, we can find u,, v, ED with |u,—v,| < 4 for 
which |f(u,) — f(v,)| = e. By continuing in this process we can find u,,,v, € D 
with |u,,—v,| <1/n for which |f(u,,) —f(v,)| = €, n= 3,4,.... 

Now, let S be the set 


S={u,|n =1,2,...} 


This set is bounded, since $ CD. Hence, its least upper bound exists. Let 
c =lub(S). Since D is closed, then c € D. Thus, as in the proof of Theorem 
3.4.4, we can find points Unp Uny o Uny in S such that lim; „s Un, =C. 


Since f(x) is continuous, there exists a ô’ > 0 such that 
€ 
I(x) flO < 5, 
if |x —c| < 6’ for any given e> 0. Let us next choose k large enough such 
that if n, > N, where N is some large positive number, then 
6' 1 8 


OS as and —<—. (3.16) 


ju 
ny 2 


nk 


Since |u,, —v,,| <1/n,, then 


<— + <8’ (3.17) 
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for n, >N. From (3.16) and (3.17) and the continuity of f(x) we conclude 
that 


(Fn) F< 5 and An) -AO E: 
Thus, 


(Fun) Fn l fn) FAO) ZECE) 
<e. (3.18) 


However, as was seen earlier, 


| f(un) = f(v,)| 2 E, 


hence, 
Fn) —f(%»,) |2 €, 


which contradicts (3.18). This leads us to assert that f(x) is uniformly 
continuous on D. o 


3.4.2. Lipschitz Continuous Functions 
Lipschitz continuity is a specialized form of uniform continuity. 
Definition 3.4.6. The function f: D >R is said to satisfy the Lipschitz 


condition on a set E CD if there exist constants, K and a, where K > 0 and 
0<a<_1 such that 


| f(41) —f(%2) |< Klx; —x,1° 


for all x,,x, E E. o 


Notationally, whenever f(x) satisfies the Lipschitz condition with con- 
stants K and a on a set E, we say that it is Lip(K, a) on E. In this case, 
f(x) is called a Lipschitz continuous function. It is easy to see that a Lipschitz 
continuous function on £ is also uniformly continuous there. 

As an example of a Lipschitz continuous function, consider f(x) = vx, 
where x >0. We claim that vx is Lip(1,4) on its domain. To show this, we 
first write 


eat = Vx. | < Vx + yx. 
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Hence, 


SSE) 
[Vx - yx | <|x,;—x,]. 


Thus, 


Wa = vast nl, 


which proves our claim. 


3.5. INVERSE FUNCTIONS 


From Chapter 1 we recall that one of the basic characteristics of a function 
y = f(x) is that two values of y are equal if they correspond to the same value 
of x. If we were to reverse the roles of x and y so that two values of x are 
equal whenever they correspond to the same value of y, then x becomes a 
function of y. Such a function is called the inverse function of f and is 
denoted by f~'. We conclude that the inverse of f: D > R exists if and only 
if f is one-to-one. 


Definition 3.5.1. Let f: D >R. If there exists a function f ': f(D) >D 
such that f-'[f(x)]=x and all xED and f[f~'(y)]=y for all y € f(D), 
then f~' is called the inverse function of f. o 


Definition 3.5.2. Let f: D > R. Then, f is said to be monotone increas- 
ing [decreasing] on D if whenever x,,x,@D are such that x, <x), 
then f(x,) <f(x,) [f(x,) =f(x,)]. The function f is strictly monotone in- 
creasing [decreasing] on D if f(x,) < f(x) [f(x,) > f(x,)] whenever x, <x). 

Oo 


If f is either monotone increasing or monotone decreasing on D, then it is 
called a monotone function on D. In particular, if it is either strictly 
monotone increasing or strictly monotone decreasing, then f(x) is strictly 
monotone on D. 

Strictly monotone functions have the property that their inverse functions 
exist. This will be shown in the next theorem. 


Theorem 3.5.1. Let f: D—R be strictly monotone increasing (or de- 
creasing) on D. Then, there exists a unique inverse function f~', which is 
strictly monotone increasing (or decreasing) on f(D). 


Proof. Let us suppose that f is strictly monotone increasing on D. To 
show that f -1 exists as a strictly monotone increasing function on f(D). 
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Suppose that x}, x, ED are such that f(x,)=f(x,)=y. If x, #x,, then 
X4 <x, Or X, < x4. Since f is strictly monotone increasing, then f(x,) < f(x) 
or f(x,) <f(x,). In either case, f(x,) #f(x,), which contradicts the assump- 
tion that f(x,)=f(x,). Hence, x; =x,, that is, f is one-to-one and has 
therefore a unique inverse f~t. 

The inverse f~! is strictly monotone increasing on f(D). To show this, 
suppose that f(x,) <f(x,). Then, x, <x,. If not, we must have x, >x,. In 
this case, f(x,)=f(x,) when x, =x,, or f(x,) > f(x.) when x, >x,, since f 
is strictly monotone increasing. However, this is contrary to the assumption 
that f(x,) <f(x,). Thus x, <x, and f~' is strictly monotone increasing. 

The proof of Theorem 3.5.1 when “increasing” is replaced with “decreas- 
ing” is similar. o 


Theorem 3.5.2. Suppose that f: D > R is continuous and strictly mono- 
tone increasing (decreasing) on [a,b] cD. Then, f~' is continuous and 
strictly monotone increasing (decreasing) on f(a, b]). 


Proof. By Theorem 3.5.1 we only need to show the continuity of f7'. 
Suppose that f is strictly monotone increasing. The proof when f is strictly 
monotone decreasing is similar. 

Since f is continuous on a closed and bounded interval, then by Corollary 
3.4.1 it must achieve its infimum and supremum on [a, b]. Furthermore, 
because f is strictly monotone increasing, its infimum and supremum must 
be attained at only a and b, respectively. Thus 


f(La,b]) =[f(4), f(4)]. 


Let d€[f(a), f(b)]. There exists a unique value c, a <c <b, such that 
f(c) =d. For any e> 0, let r be defined as 


r= min[ f(c) -f(c = €), f(c + €) —f(c)]. 


Then there exists a 6, 0 < ô< 7, such that all the x’s in [a, b] that satisfy the 
inequality 


|f(x) -d< ê 
must also satisfy the inequality 
Ix —cl <e. 
This is true because 
f(c) —f(c) +(e — €) <d—8<f(x) <d+6 
<f(e) +fle +) —f(c), 
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that is, 
f(e—€) <f(x) <f(c + e). 
Using the fact that f~' is strictly monotone increasing (by Theorem 3.5.1), 


we conclude that 
c-—é€<x<cte, 


that is, |x— c| < e. It follows that x=f~'(y) is continuous on [ f(a), f(b)]. 
Oo 


Note that in general if y=f(x), the equation y—f(x)=0 may not 
produce a unique solution for x in terms of y. If, however, the domain of f 
can be partitioned into subdomains on each of which f is strictly monotone, 
then f can have an inverse on each of these subdomains. 


EXAMPLE 3.5.1. Consider the function f: R > R defined by y = f(x) = x°. 
It is easy to see that f is strictly monotone increasing for all x€ R. It 


therefore has a unique inverse given by f-'(y)=y!”. 


EXAMPLE 3.5.2. Let f: [—1,1] >R be such that y = f(x) =x° — x. From 
Figure 3.2 it can be seen that f is strictly monotone increasing on D} = 
[-1,-57'“*] and D, =[57"4,1], but is strictly monotone decreasing on 
D,=[-5~'/*,5~'/4]. This function has therefore three inverses, one on 
each of D,, D,, and D}. By Theorem 3.5.2, all three inverses are continuous. 


EXAMPLE 3.5.3. Let f: R—[-1,1] be the function y =f(x)=sin x, 
where x is measured in radians. There is no unique inverse on R, since the 
sine function is not strictly monotone there. If, however, we restrict the 
domain of f to D=[-—7/2,7/2], then f is strictly monotone increasing 
there and has the unique inverse f~'(y) = Arcsin y (see Example 1.3.4). The 
inverse of f on [7/2,32/2] is given by f-'(y)=a-— Arcsin y. We can 
similarly find the inverse of f on [37/2,57/2], [57/2, 77/2], etc. 


f(x) 


Figure 3.2. The graph of the function f(x) = -1 
x5 =x. -1 x 1 
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3.6. CONVEX FUNCTIONS 


Convex functions are frequently used in operations research. They also 
happen to be continuous, as will be shown in this section. The natural 
domains for such functions are convex sets. 


Definition 3.6.1. A set DCR is convex if Ax, + (1 — A)x, © D whenever 
X,,xX, belong to D and 0<A <1. Geometrically, a convex set contains the 
line segment connecting any two of its points. The same definition actually 
applies to convex sets in R”, the n-dimensional Euclidean space (n > 2). For 
example, each of the following sets is convex: 


1. Any interval in R. 
2. Any sphere in R°, and in general, any hypersphere in R”, n > 4. 
3. The set {(x, y) € R?| |x| +lyl < 1}. See Figure 3.3. o 


Definition 3.6.2. A function f: D >R is convex if 
Flax + (= A)x] < Af(x1) + (1 =A) f(x) (3.19) 


for all x, x, ED and any A such that 0 < A< 1. The function f is strictly 
convex if inequality (3.19) is strict for x, # x3. 

Geometrically, inequality (3.19) means that if P and Q are any two points 
on the graph of y = f(x), then the portion of the graph between P and Q lies 
below the chord PQ (see Figure 3.4). Examples of convex functions include 
f(x) =x? on R, f(x)=sin x on [7,27], f(x) =e* on R, f(x) = —log x for 
x > 0, to name just a few. o 


Definition 3.6.3. A function f: D > R is concave if —f is convex. Oo 
We note that if f: [a,b] —> R is convex and the values of f at a and b are 


finite, then f(x) is bounded from above on [a,b] by M=max{f(a), f(b). 
This is true because if x € [a,b], then x = Aa + (1 — A)b for some A € [0,1], 


Figure 3.3. The set {(x, y) € R?| |x| +ly| <1}. 
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f(x) 


Figure 3.4. The graph of a convex function. x 


since [a, b] is a convex set. Hence, 
f(x) <Af(a) + (1 —A)f() 
<AM+(1-A)M=M. 


The function f(x) is also bounded from below. To show this, we first note 
that any x € [a, b] can be written as 


a+b 
x= +t, 
2 
where 
a+b a+b 
a— <t<b-— 
Now, 
a+b 1 (a+b 1 (a+b zai 
< +t| +> -t . 
s| 2 <5 2 | 2 í ( ) 


since if (a+b)/2+t belongs to [a,b], then so does (a +b)/2-— t. From 
(3.20) we then have 


a+b a+b a+b 
aa 
Since 
a+b 
ooe 
then 
a+b a+b 
eee 
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that is, f(x) >m for all x €[a, b], where 


a+b 


m=24{ —M. 


Another interesting property of convex functions is given in Theorem 
3.6.1. 


Theorem 3.6.1. Let f: D >R be a convex function, where D is an open 
interval. Then f is Lip(K,1) on any closed interval [a,b] contained in D, 
that is, 


| f(*1) — f(%2)|<Klx, — xl (3.21) 


for all x, x, € [a, b]. 


Proof. Consider the closed interval [a — e, b + e], where e> 0 is chosen so 
that this interval is contained in D. Let m’ and M’ be, respectively, the 
lower and upper bounds of f (as was seen earlier) on [a — e, b + e]. Let 
X, X, be any two distinct points in [a, b]. Define z, and A as 


e(x,—x1) 


lxi = x| 


Zi =X, + 


> 


7 Ix, —x,| 


e+ |x xl 


Then z; E[a-— e,b + e]. This is true because (x, — x1)/ |x; —x,| is either 
equal to 1 or to —1. Since x, € [a, b], then 


€( x, =x) 


a—e<x,—E<X,+ 
Ix, — x3l 


<x, +te<b+e. 


Furthermore, it can be verified that 
AE ee 
We then have 
Fx) < AFC) + A= NF) = AL f(z.) =F] HEC). 
Thus, 
f(%2) —f(1) Al f(z) -FC)] 
<A[M'-m’'] 


Bites Fy ae 
< ———(M'-m') =K|x, -x,], (3.22) 
€ 
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where K =(M' — m')/e. Since inequality (3.22) is true for any x,, x, € [a, b], 
we must also have 


f(x) —f(%2) < Klx; = x5). (3.23) 
From inequalities (3.22) and (3.23) we conclude that 
| f(1) —f(%2)| < Klx; -x2 


for any x,, x, €[a, b], which shows that f(x) is Lip(K, 1) on [a, b]. o 


Using Theorem 3.6.1 it is easy to prove the following corollary: 


Corollary 3.6.1. Let f: D > R be a convex function, where D is an open 
interval. If [a, b] is any closed interval contained in D, then f(x) is uniformly 
continuous on [a, b] and is therefore continuous on D. 


Note that if f(x) is convex on (a,b), then it does not have to be 
continuous at the end points of the interval. It is easy to see, for example, 
that the function f: [—1,1]— R defined as 


_fx*, —1<x<1, 
p(x) = {8 x=1,—1 


is convex on (— 1,1), but is discontinuous at x = —1,1. 


3.7. CONTINUOUS AND CONVEX FUNCTIONS IN STATISTICS 


The most vivid examples of continuous functions in statistics are perhaps the 
cumulative distribution functions of continuous random variables. If X is a 
continuous random variable, then its cumulative distribution function 


F(x)=P(X <x) 


ae 


that is, the distribution of X assigns a zero probability to any single value. 
This is a basic characteristic of continuous random variables. 

Examples of continuous distributions include the beta, Cauchy, chi- 
squared, exponential, gamma, Laplace, logistic, lognormal, normal, t, uni- 
form, and the Weibull distributions. Most of these distributions are described 


is continuous on R. In this case, 


1 
P(X=a) = lim jele 
nao n 
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in introductory statistics books. A detailed account of their properties and 
uses is given in the two books by Johnson and Kotz (1970a, 1970b). 

It is interesting to note that if X is any random variable (not necessarily 
continuous), then its cumulative distribution function, F(x), is right-continu- 
ous on R (see, for example, Harris, 1966, page 55). This function is also 
monotone increasing on R. If F(x) is strictly monotone increasing, then by 
Theorem 3.5.1 it has a unique inverse F =i y). In this case, if Y has the 
uniform distribution over the open interval (0,1), then the random variable 
F-!(Y) has the cumulative distribution function F(x). To show this, consider 
X =F7'(Y). Then, 


P(X <x) =P|F!(Y) <x] 
=P|Y<F(x)] 
=F(x). 


This result has an interesting application in sampling. If Y,,Y,,...,Y, 
form an independent random sample from the uniform distribution U(0, 1), 
then F~'(Y,), F-'(Y,),..., F'(Y,) will form an independent sample from a 
distribution with the cumulative distribution function F(x). In other words, 
samples from any distribution can be generated through sampling from the 
uniform distribution U(0,1). This result forms the cornerstone of Monte 
Carlo simulation in statistics. Such a method provides an artificial way of 
collecting “data.” There are situations where the actual taking of a physical 
sample is either impossible or too expensive. In such situations, useful 
information can often be derived from simulated sampling. Monte Carlo 
simulation is also used in the study of the relative performance of test 
statistics and parameter estimators when the data come from certain speci- 
fied parent distributions. 

Another example of the use of continuous functions in statistics is in limit 
theory. For example, it is known that if {X,}7_, is a sequence of random 
variables that converges in probability to c, and if g(x) is a continuous 
function at x =c, then the random variable g(X,,) converges in probability to 
g(c) as n>. By definition, a sequence of random variables {X,}"_, 
converges in probability to a constant c if for a given e> 0, 


lim P(|X,, —c| = €) =0. 


In particular, if {X,,}°_, is a sequence of estimators of a parameter c, then 
X,, is said to be a consistent estimator of c if X, converges in probability to 
c. For example, the sample mean 
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of a sample of size n from a population with a finite mean wp is a consistent 
estimator of u according to the law of large numbers (see, for example, 
Lindgren, 1976, page 155). Other types of convergence in statistics can be 
found in standard mathematical statistics books (see the annotated bibliogra- 
phy). 

Convex functions also play an important role in statistics, as can be seen 
from the following examples. 

If f(x) is a convex function and X is a random variable with a finite mean 
w=E(X), then 


E[f(X)] =f[E(X)]. 


Equality holds if and only if X is constant with probability 1. This inequality 
is known as Jensen’s inequality. If f is strictly convex, the inequality is strict 
unless X is constant with probability 1. A proof of Jensen’s inequality is given 
in Section 6.7.4. See also Lehmann (1983, page 50). 

Jensen’s inequality has useful applications in statistics. For example, it can 
be used to show that if x,,x,,...,x, are n positive scalars, then their 
arithmetic mean is greater than or equal to their geometric mean, which is 
equal to (II7_,x;)'/”. This can be shown as follows: 


Consider the convex function f(x) = —log x. Let X be a discrete random 
variable that takes the values x,,x,,...,x, With probabilities equal to 1/n, 
that is, 

1 
P(X=x)={ 7" HEX Ny cay Ky 
0 otherwise. 


Then, by Jensen’s inequality, 


E(-log X) > -log E(X). (3.24) 
However, 
1 n 
E(-log X) = —— } log x,, (3.25) 
Wit 
and 


-log E(X) = ~oe{ ~ £x); (3.26) 
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By using (3.25) and (3.26) in (3.24) we get 

J7 1 2 

— Vlog x; < o| * al. 

n izi n iz 

or 


1 n 
soe * a 
n jai 


Since the logarithmic function is monotone increasing, we conclude that 
n 1/n 1 2 
(Tx) <> 
i=l Ahi 
Jensen’s inequality can also be used to show that the arithmetic mean is 


greater than or equal to the harmonic mean. This assertion can be shown as 


follows: 
Consider the function f(x)=x~', which is convex for x >0. If X is a 


random variable with P(X > 0) = 1, then by Jensen’s inequality, 


(>| > Soy (3.27) 


In particular, if X has the discrete distribution described earlier, then 


and 
1 n 
E(X)=— Lx. 
n 


By substitution in (3.27) we get 


or 
12 HIN 
D . (3.28) 


The quantity on the right of inequality (3.28) is the harmonic mean of 
Nas Hassles MS: 
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Another example of the use of convex functions in statistics is in the 
general theory of estimation. Let X,, X,,..., X, be a random sample of size 
n from a population whose distribution depends on an unknown parameter 
6. Let w(X,, X,,...,X,) be an estimator of 6. By definition, the loss 
function L[ 0, w( X,, X,,..., X,,)] is a nonnegative function that measures the 
loss incurred when @ is estimated by w(X,, X,,..., X,,). The expected value 
(mean) of the loss function is called the risk function, denoted by R(6, œ), 
that is, 


R(6, œ) =E{L] 6, w( X,, X5,...,X,,)]}- 


The loss function is taken to be a convex function of 6. Examples of loss 
functions include the squared error loss, 


L[0, o( X,, Xp,..-,X,)] = [0- @(X,, X,.--, XD, 
and the absolute error loss, 
L| 0, o( X,, X5,...,X,,) |] =|0- o( X1, X,..., Xp) l. 


The first loss function is strictly convex, whereas the second is convex, but not 
strictly convex. 

The goodness of an estimator of 6 is judged on the basis of its risk 
function, assuming that a certain loss function has been selected. The smaller 


the risk, the more desirable the estimator. An estimator w*(X,, X,,..., X,,) 
is said to be admissible if there is no other estimator w(X,, X,,..., X,,) of 0 
such that 


R(6, w) < R(6, w*) 


for all 0E Q (Q is the parameter space), with strict inequality for at least 
one 6. An estimator w)(X,, X>,..., X,,) is said to be a minimax estimator if it 
minimizes the supremum (with respect to 0) of the risk function, that is, 


sup R( 0, wọ) < sup R(0, wœ), 
OEN OEN 


where w(X,, X,..., X,,) is any other estimator of 0. It should be noted that 
a minimax estimator may not be admissible. 


EXAMPLE 3.7.1. Let X,, X3,..., Xa be a random sample of size 20 from 
the normal distribution N(6,1),—-°<6<%, Let w(X,, X5,..., X2) =X 
be the sample mean, and let w,(X,, X>,..., X59) = 0. Then, using a squared 
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error loss function, 
R(6, o) =E|(Xq— 0) | = Var( F») = 5, 
R( 0, œ) = E[(0— 6)"] = 6. 
In this case, 
sup [RCo =a 


whereas 


sup [ R(@, w,)] = %. 
6 


Thus w, =X is a better estimator than w, = 0. It can be shown that X, is 
the minimax estimator of 0 with respect to a squared error loss function. 
Note, however, that X» is not an admissible estimator, since 


R(0, œw) <R(0, œ) 
for 0 > 207! or 0< —207!⁄2. However, for — 207! < 8 < 2071/7, 


R(0, w) <R(90, w). 
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EXERCISES 
In Mathematics 


3.1. Determine if the following limits exist: 


i xř-1 
(a) ot x—1 í 
b li . 1 

im x sin—, 
( ) x0 x 


(c) lim (sa) /(sa)}, 
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x?-1 
; x>0, 

s x-1 

(d) lim f(x), where f(x) = end 
ene. <0. 
Ga n 


3.2. Show that 
(a) tan x? =0(x°) as x > 0. 
(b) x =o(Vx) as x 30, 
(ec) OC) =o(x) as x > &, 
(d f(x)g(x)=x-'+0OQ) as x->0, where f(x) =x +0(x’), g(x) = 
x7? + O(x7!), 


3.3. Determine where the following functions are continuous, and indicate 
the points of discontinuity (if any): 


xsin(1/x), x #0, 
a x)= 
(a) CORE = 
1/2 
(b) f(x) = (=a "BF, 
1, x=2, 
_fx7™/", x #0, 
(©) Bae ear 
where m and n are positive integers, 
xt —2x?+3 
(d) (OS a=an = ; x#1. 


3.4. Show that f(x) is continuous at x =a if and only if it is both left- 
continuous and right-continuous at x =a. 


3.5. Use Definition 3.4.1 to show that the function 


f(=) =x -1 


is continuous at any point a € R. 


3.6. For what values of x is 


continuous? 
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3.7. 


3.8. 


3.9. 


3.10. 


3.11. 


3.12. 


3.13. 


3.14. 
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Consider the function 


x— |x| 
f(x)= ; -1<x<1, x#0. 
x 


Can f(x) be defined at x =0 so that it will be continuous there? 


Let f(x) be defined for all x€R and continuous at x= 0. Further- 
more, 


f(a +b) =f(a) +f(d), 


for all a,b in R. Show that f(x) is uniformly continuous everywhere 
in R. 


Let f(x) be defined as 


2x—1, 0<x<1, 
x? — 5x? +5, 1<x<2. 


rœ = | 
Determine if f(x) is uniformly continuous on [0, 2]. 
Show that f(x) = cos x is uniformly continuous on R. 
Prove Theorem 3.4.1. 


A function f: D >R is called upper semicontinuous at a € D if for a 
given e> 0 there exists a ô> 0 such that 


f(x) <f(4) +€ 
for all x € N,(a) A D. If the above inequality is replaced with 
f(x) >f(a)- e, 


then f(x) is said to be lower semicontinuous. 
Show that if D is closed and bounded, then 


(a) f(x) is bounded from above on D if f(x) is upper semicontinuous. 
(b) f(x) is bounded from below on D if f(x) is lower semicontinuous. 


Let f: [a,b] R be continuous such that f(x) = 0 for every rational 
number in [a, b]. Show that f(x) =0 for every x in [a, b]. 


For what values of x does the function 
f(x) =3+|[x-1] + [x41 | 


have a unique inverse? 
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3.15. 


3.16. 


3.17. 


3.18. 


3.19. 


3.20. 


Let f: R >R be defined as 


Oe eer 


x-1, x>1. 


Find the inverse function f~t, 


Let f(x) =2x? — 8x + 8. Find the inverse of f(x) for 
(a) x< -2, 
(b) x >2. 


Suppose that f: [a,b] R is a convex function. Show that for a given 
e > 0 there exists a 6>0 such that 


2 |f(a;) —f(b)|<e 


for every finite, pairwise disjoint family of open subintervals {(a;, b;)}"_ 
of [a, b] for which ©)_ ,(b; — a;) < ô. 

Note: A function satisfying this property is said to be absolutely contin- 
uous on [a, b]. 


Let f: [a,b] > R be a convex function. Show that if a,,a,,...,a, are 
positive numbers and x,,x,,...,x,, are points in [a, b], then 


| | 14 f(%) 
f < 
A A 


where A = }X;_14; 


Let f(x) be continuous on D CR. Let S be the set of all x € D such 
that f(x) = 0. Show that S is a closed set. 


Let f(x) be a convex function on DCR. Show that exp[ f(x)] is also 
convex on D. 


In Statistics 


3.21. 


Let X be a continuous random variable with the cumulative distribu- 
tion function 


F(x)=1-e7/, x>0. 


This is known as the exponential distribution. Its mean and variance 
are 46 =0,0°= 0°, respectively. Generate a random sample of five 
observations from an exponential distribution with mean 2. 
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3.22. 


3.23. 


3.24. 


3.25. 
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[ Hint: Select a ten-digit number from the table of random numbers, for 
example, 8389611097. Divide it by 10’° to obtain the decimal number 
0.8389611097. This number can be regarded as an observation from the 
uniform distribution U(0, 1). Now, solve the equation F(x) = 
0.8389611097. The resulting value of x is considered as an observation 
from the prescribed exponential distribution. Repeat this process four 
more times, each time selecting a new decimal number from the table 
of random numbers.] 


Verify Jensen’s inequality in each of the following two cases: 
(a) X is normally distributed and f(x) = |x|. 
(b) X has the exponential distribution and f(x) =e7*. 


Use the definition of convergence in probability to verify that if the 
sequence of random variables {X,,}*_, converges in probability to zero, 
then so does the sequence {X,7}”_,. 


Show that 
E(X?) > [EXD]. 


[Hint: Let Y= |X|. Apply Jensen’s inequality to Y with f(x)=x°.] 
Deduce that if X has a mean wp and a variance o”, then 


E(\X— pl) <o. 


Consider the exponential distribution described in Exercise 3.21. Let 
Xi X>,...,X, be a sample of size n from this distribution. Consider 
the following estimators of 0: 

(a) w,(X, X5,..., X,) =X, the sample mean. 

(b) w,(X,, X,,...,X,) =X, +1, 

(c) wX; X,,...,X,) =X,. 

Determine the risk function corresponding to a squared error loss 
function for each one of these estimators. Which estimator has the 
smallest risk for all values of 0? 


CHAPTER 4 


Differentiation 


Differentiation originated in connection with the problems of drawing tan- 
gents to curves and of finding maxima and minima of functions. Pierre 
de Fermat (1601-1665), the founder of the modern theory of numbers, is 
credited with having put forth the main ideas on which differential calculus 
is based. 

In this chapter, we shall introduce the notion of differentiation and study 
its applications in the determination of maxima and minima of functions. We 
shall restrict our attention to real-valued functions defined on R, the set of 
real numbers. The study of differentiation in connection with multivariable 
functions, that is, functions defined on R” (n> 1), will be considered in 
Chapter 7. 


4.1. THE DERIVATIVE OF A FUNCTION 


The notion of differentiation was motivated by the need to find the tangent 
to a curve at a given point. Fermat’s approach to this problem was inspired by 
a geometric reasoning. His method uses the idea of a tangent as the limiting 
position of a secant when two of its points of intersection with the curve tend 
to coincide. This has lead to the modern notation associated with the 
derivative of a function, which we now introduce. 


Definition 4.1.1. Let f(x) be a function defined in a neighborhood N,(x,) 
of a point x,. Consider the ratio 


we fth Aao) as 


where h is a nonzero increment of x, such that —r<h <r. If (h) has a 
limit as h > 0, then this limit is called the derivative of f(x) at x, and is 
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denoted by f’(x,). It is also common to use the notation 


df(x) 
Me ken =f'(xo). 


We thus have 


fo +h) —f(%o) 


7 (4.2) 


1 PR li 
f'(%) eT 
By putting x =x, +h, formula (4.2) can be written as 


Aiea oe 


XXo X—-Xo 

If f'(x) exists, then f(x) is said to be differentiable at x =x). Geometri- 
cally, f’(x9) is the slope of the tangent to the graph of the function y = f(x) 
at the point (xo, yo), where yọ = f(x). If f(x) has a derivative at every point 
of a set D, then f(x) is said to be differentiable on D. 

It is important to note that in order for f (xo) to exist, the left-sided and 
right-sided limits of (A) in formula (4.1) must exist and be equal as h > 0, 
or aS x approaches x, from either side. It is possible to consider only 
one-sided derivatives at x =x). These occur when (A) has just a one-sided 
limit as h > 0. We shall not, however, concern ourselves with such deriva- 
tives in this chapter. o 


Functions that are differentiable at a point must necessarily be continuous 
there. This will be shown in the next theorem. 


Theorem 4.1.1. Let f(x) be defined in a neighborhood of a point xp. If 
f(x) has a derivative at x», then it must be continuous at x). 


Proof. From Definition 4.1.1 we can write 
f(x%o +h) — f(x) =hG(A). 


If the derivative of f(x) exists at xo, then (A) > f’(x,) as h > 0. It follows 
from Theorem 3.2.1(2) that 


f(x +h) —f(%) > 0 
as h > 0. Thus for a given e> 0 there exists a ô> 0 such that 


| f( 40 +h) —f(%9)|<e 


if |A| < 6. This indicates that f(x) is continuous at xp. oO 
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It should be noted that even though continuity is a necessary condition for 
differentiability, it is not a sufficient condition, as can be seen from the 
following example: Let f(x) be defined as 


{1 P 
E) + > 

al x 
0, x=0. 


This function is continuous at x = 0, since f(0) = lim, „o f(x) = 0 by the fact 
that 


x sin—| < |x| 
x 


for all x. However, f(x) is not differentiable at x = 0. This is because when 
x=0, 


f(A) -f(9) 
gh) = = 
1 
Ga 
= — 4 since h #0, 
aa 
=sinT 


which does not have a limit as h > 0. Hence, f’(0) does not exist. 

If f(x) is differentiable on a set D, then f’(x) is a function defined on D. 
In the event f'(x) itself is differentiable on D, then its derivative is called the 
second derivative of f(x) and is denoted by f”(x). It is also common to use 
the notation 


d*f(x) 
dx? 


=f" (x). 


By the same token, we can define the nth (n > 2) derivative of f(x) as the 
derivative of the (n — 1)st derivative of f(x). We denote this derivative by 


a"f(x) | 
dx” 


f(x), n=2,3,.... 


We shall now discuss some rules that pertain to differentiation. The 
reader is expected to know how to differentiate certain elementary functions 
such as polynomial, exponential, and trigonometric functions. 
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Theorem 4.1.2. Let f(x) and g(x) be defined and differentiable on a 
set D. Then 


1. Laf(x) + Bg(x)]' = af'(x) + Bg'(x), where a and B are constants. 
2. [f(x)g(x)]' = f'g) + f(x) g(x). 
3. [fx /gQ@0l =[f' g(x) — fg (x8 x) if g(x) #0. 


Proof. The proof of (1) is straightforward. To prove (2) we write 


fen SEE SIE) 


h>0 h 
ig OED AO HOLE +h) ~8)] 
h>0 h 
f(x+h)-=f(x) 


= li +h) li 
bls am h 


g(x +A) -8(x) 
i ; 


+f(2) lim 


However, lim, ,) g(x +h)=g(x), since g(x) is continuous (because it is 
differentiable). Hence, 


san PETWO HH) -fga 
h>0 h 


=g(x) f'(x) +f(x)8' (x). 


Now, to prove (3) we write 


i f(x+h)/g(x+h)-f(x)/g(x) 


h>0 h 


hm SFE +H) -Felz Hh) 
h>0 hg(x)g(x +h) 


hn SOU +H) FO] FL +4) ~ 8) 
h>0 hg(x)g(x +h) 
— lim, 5 ofg(*) [F(% +h) — f(x) ]/h - f(x) 8 (x+h) -8(*)]/h} 
g(x)lim, „o 8(x +A) 
_ B(x) F(x) Fag x) 
g’ (x) l 
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Theorem 4.1.3 (The Chain Rule). Let f: D, >R and g: D, >R be two 
functions. Suppose that f(D,) CD,. If f(x) is differentiable on D, and g(x) 
is differentiable on D,, then the composite function h(x) = gl f(x)] is differ- 
entiable on D,, and 


delf] — dg[f(x)] df(x) 
dx d(x) de ` 


Proof. Let z = f(x) and t = f(x +h). By the fact that g(z) is differentiable 
we can write 


g[f(x+h)] -s[f(+)] =8(t)-8(2z) 
=(t—z)g'(z) +0(t-z), (4.3) 


where, if we recall, the o-notation was introduced in Section 3.3. We then 
have 


ee) _ e(o) + un Z, (4.4) 


As h > 0, t >z, and hence 


E aaa ru f(xth)—f(x) df(x) 
h>0 h h>0 h dx ` 


Now, by taking the limits of both sides of (4.4) as h —> 0 and noting that 


we conclude that 


dg f(x)] — df(x) agl f(| 
dx dx df(x) 


NoTE 4.1.1. We recall that f(x) must be continuous in order for f'(x) to 
exist. However, if f’(x) exists, it does not have to be continuous. Care should 
be exercised when showing that f'(x) is continuous. For example, let us 
consider the function 


98 DIFFERENTIATION 


Suppose that it is desired to show that f'(x) exists, and if so, to determine if 
it is continuous. To do so, let us first find out if f'(x) exists at x = 0: 


Pon O A 


h>0 h 


1 
h? sin — 


= lim 
ho0 h 


= limhsin— =0. 
h>0 
Thus the derivative of f(x) exists at x = 0 and is equal to zero. For x + 0, it is 
clear that the derivative of f(x) exists. By applying Theorem 4.1.2 and using 
our knowledge of the derivatives of elementary functions, f'(x) can be 
written as 


1 1 
f'(x) = 2x sin— COS x #0, 


0, x=0. 


We note that f'(x) exists for all x, but is not continuous at x = 0, since 
. . . 1 1 
lim f'(x) = lim |2x sin— — cos— 
x0 x>0 X X 


does not exist, because cos(1/x) has no limit as x > 0. However, for any 
nonzero value of x, f'(x) is continuous. 

If f(x) is a convex function, then we have the following interesting result, 
whose proof can be found in Roberts and Varberg (1973, Theorem C, 
page 7): 


Theorem 4.1.4. If f: (a,b) — R is convex on the open interval (a, b), then 
the set $ where f'(x) fails to exist is either finite or countable. Moreover, 
f'(x) is continuous on (a, b) — S, the complement of $ with respect to (a, b). 


For example, the function f(x) = |x| is convex on R. Its derivatives does 
not exist at x = 0 (why?), but is continuous everywhere else. 

The sign of f'(x) provides information about the behavior of f(x) in a 
neighborhood of x. More specifically, we have the following theorem: 


Theorem 4.1.5. Let f: D—R, where D is an open set. Suppose that 
f'(x) is positive at a point xy € D. Then there is a neighborhood N;(x)) cD 
such that for each x in this neighborhood, f(x)> f(xy) if x>x,, and 
f(x) < fxg) if x < xo. 
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Proof. Let e= f'(x)/2. Then, there exists a 6 > 0 such that 


i'm) —€< a 


<f'(xo) +€ 


if |x —x,| < 6. Hence, if x >Xo, 


f(x) =f) > 


bA 


(x =x) f' (xo) 
2 


which shows that f(x) > f(x) since f'(x) > 0. Furthermore, since 


F(x) -f(%) <0 


X — Xo 


A 


then f(x) <f(xo) if x < xo. m 


If f'(xọ)<0, it can be similarly shown that f(x) < f(x) if x >x and 
f(x) > f(xo) if x < xo. 


4.2. THE MEAN VALUE THEOREM 


This is one of the most important theorems in differential calculus. It is also 
known as the theorem of the mean. Before proving the mean value theorem, 
let us prove a special case of it known as Rolle’s theorem. 


Theorem 4.2.1 (Rolle’s Theorem). Let f(x) be continuous on the closed 
interval [a,b] and differentiable on the open interval (a, b). If f(a) = f(b), 
then there exists a point c, a <c <b, such that f’(c)=0. 


Proof. Let d denote the common value of f(a) and f(b). Define h(x) = 
f(x) -—d. Then h(a) = h(b) = 0. If h(x) is also zero on (a, b), then h'(x) =0 
for a<x<b and the theorem is proved. Let us therefore assume that 
h(x) +0 for some x € (a, b). Since h(x) is continuous on [a, b] [because f(x) 
is], then by Corollary 3.4.1 it must achieve its supremum M at a point € in 
[a,b], and its infimum m at a point ņ in [a,b]. If h(x)>0 for some 
x €(a,b), then we must obviously have a < é< b, because h(x) vanishes at 
both end points. We now claim that h’(é)=0. If A’(€) > 0 or <0, then by 
Theorem 4.1.5, there exists a point x, in a neighborhood N,(€)C (a,b) at 
which h(x,) >h(é), a contradiction, since h(é) =M. Thus h’(é) = 0, which 
implies that f’(é) = 0, since h'(x) = f'(x) for all x € (a, b). We can similarly 
arrive at the conclusion that f'(7) = 0 if h(x) <0 for some x € (a, b). In this 
case, if h'(m) #0, then by Theorem 4.1.5 there exists a point x, in a neigh- 
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borhood N; (n) C (a, b) at which h(x,) < h(n) =m, a contradiction, since m 
is the infimum of A(x) over [a, b]. 

Thus in both cases, whether h(x) >0 or <0 for some x € (a, b), we must 
have a point c, a <c <b, such that f’(c) =0. o 


Rolle’s theorem has the following geometric interpretation: If f(x) satis- 
fies the conditions of Theorem 4.2.1, then the graph of y = f(x) must have a 
tangent line that is parallel to the x-axis at some point c between a and b. 
Note that there can be several points like c inside (a, b). For example, the 
function y =x° — 5x? + 3x — 1 satisfies the conditions of Rolle’s theorem on 
the interval [a,b], where a=0 and b= (5 + v13 )/2. In this case, f(a) = 
f(b) = -1, and f'(x) = 3x? — 10x + 3 vanishes at x = $ and x =3., 


Theorem 4.2.2 (The Mean Value Theorem). If f(x) is continuous on the 
closed interval [a, b] and differentiable on the open interval (a, b), then there 
exists a point c, a <c <b, such that 


f(b) =f(a) + (b-a)f'(c). 
Proof. Consider the function 


P(x) =f(x) —f(a) —A(x—a), 


where 


_ f(b) f(a) 
© b-a ` 


A 


The function ®(x) is continuous on [a,b] and is differentiable on (a, b), 
since ®'(x)=f'(x)—A. Furthermore, ®(a)= 0(b)=0. It follows from 
Rolle’s theorem that there exists a point c, a <c <b, such that ®’(c)=0. 
Thus, 


b) —f(a 
IC ee aes a K 


a 


which proves the theorem. o 


The mean value theorem has also a nice geometric interpretation. If the 
graph of the function y = f(x) has a tangent line at each point of its length 
between two points P, and P, (see Figure 4.1), then there must be a point Q 
between P, and P, at which the tangent line is parallel to the secant line 
through P, and P,. Note that there can be several points on the curve 
between P, and P, that have the same property as Q, as can be seen from 
Figure 4.1. 

The mean value theorem is useful in the derivation of several interesting 
results, as will be seen in the remainder of this chapter. 
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fix) g A 
Figure 4.1. Tangent lines parallel to the se- 


x cant line. 


Corollary 4.2.1. If f(x) has a derivative f'(x) that is nonnegative (non- 
positive) on an interval (a, b), then f(x) is monotone increasing (decreasing) 
on (a,b). If f'(x) is positive (negative) on (a,b), then f(x) is strictly 
monotone increasing (decreasing) there. 


Proof. Let x, and x, be two points in (a,b) such that x, <x,. By the 
mean value theorem, there exists a point x), X4 <x») <x,, such that 


f(%2) =f(41) + (x2 = x1) f (x0). 


If f'(xo) = 0, then f(x,)>f(x,) and f(x) is monotone increasing. Similarly, 
if f'(x) <0, then f(x,) <f(x,) and f(x) is monotone decreasing. If, how- 
ever, f(x) >0, or f’(x) <0 on (a,b), then strict monotonicity follows over 
(a, b). m 


Theorem 4.2.3. If f(x) is monotone increasing [decreasing] on an interval 
(a, b), and if f(x) is differentiable there, then f'(x) > 0[f’(x) < 0] on (a, b). 


Proof. Let x9 € (a, b). There exists a neighborhood N,(x,) c(a, b). Then, 
for any x E€ N (xo) such that x # xo, the ratio 


f(x) ~f(%0) 


X—Xq 


is nonnegative. This is true because f(x) > f(x,) if x >x, and f(x) <f(x,) if 
x < xo. By taking the limit of this ratio as x > x, we claim that f’(x,) = 0. To 
prove this claim, suppose that f’(x)) <0. Then there exists a 6 >0 such that 


f(x) -f(%0) 


x— Xo 


1 
=f'(x)|< = zf (%0) 
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if |x —x,| < ô. It follows that 


sia at LET fa) si 


x — Xo 


Thus f(x)< f(x) if x>xo, which is a contradiction. Hence, f'(x) = 0. 
A similar argument can be used to show that f'(xọ)<0 when f(x) is 
monotone decreasing. o 


Note that strict monotonicity on a set D does not necessarily imply that 
f'(x) > 0, or f'(x) <0, for all x in D. For example, the function f(x) =x? is 
strictly monotone increasing for all x, but f’(0) =0. 

We recall from Theorem 3.5.1 that strict monotonicity of f(x) is a 
sufficient condition for the existence of the inverse function f~'. The next 
theorem shows that under certain conditions, f~! is a differentiable function. 


Theorem 4.2.4. Suppose that f(x) is strictly monotone increasing (or 
decreasing) and continuous on an interval [a,b]. If f'(x) exists and is 
different from zero at x) € (a,b), then the inverse function f~! is differen- 
tiable at yọ =f(x,) and its derivative is given by 


af °y) 1 
dy y=Yo f'(%) 


Proof. By Theorem 3.5.2, f~'(y) exists and is continuous. Let x, € (a, b), 
and let N(x) C (a, b) for some r > 0. Then, for any x € N,(x)), 


FIO) -F EO) #7 
Y-Yo f(x) —f(%0) 
7 1 
[FŒ —F(%0)]/(% —%0) ’ 
where y = f(x). Now, since both f and f~! are continuous, then x > x, if 


and only if y— yọ. By taking the limits of all sides in formula (4.5), we 
conclude that the derivative of f~' at yọ exists and is equal to 


daf (y) 1 
dy yY=yo f'(%0) 


(4.5) 


The following theorem gives a more general version of the mean value 
theorem. It is due to Augustin-Louis Cauchy and has an important applica- 
tion in calculating certain limits, as will be seen later in this chapter. 
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Theorem 4.2.5 (Cauchy’s Mean Value Theorem). If f(x) and g(x) are 
continuous on the closed interval [a,b] and differentiable on the open 
interval (a, b), then there exists a point c, a <c <b, such that 


[f(b) —fla)]a'(c) = [8(b) —s(ayl f'(e), 


Proof. The proof is based on using Rolle’s theorem in a manner similar to 
that of Theorem 4.2.2. Define the function y(x) as 


w(x) = (f(b) —f(x)][8(6) —8(@)] - [f() - f(a) [8(2) -8(*)]. 
This function is continuous on [a, b] and is differentiable on (a, b), since 
p(x) = -f'(x)[8(b) -8(@)] +g DLF) -f(a)]. 


Furthermore, (a) = w(b) = 0. Thus by Rolle’s theorem, there exists a point 
c, a<c <b, such that w'(c) = 0, that is, 


—f'(e)Ls(b) —8(a)] +8'(e)[F(4) —f(4)] = 0. (4.6) 


In particular, if g(b)— g(a) #0 and f'(x) and g'(x) do not vanish at the 
same point in (a, b), then formula (4.6) an be written as 


fc) _ f(b) - F(a) 
gi(c) g(b)-g(a)- 


Oo 


An immediate application of this theorem is a very popular method in 
calculating the limits of certain ratios of functions. This method is attributed 
to Guillaume Francois Marquis de |’Hospital (1661-1704) and is known as 
l’Hospital’s rule. It deals with the limit of the ratio f(x)/g(x) as x > a when 
both the numerator and the denominator tend simultaneously to zero or to 
infinity as xa. In either case, we have what is called an indeterminate 
ratio caused by having 0/0 or ~/~ as x >a. 


Theorem 4.2.6 (Hospitals Rule). Let f(x) and g(x) be continuous on 
the closed interval [a,b] and differentiable on the open interval (a, b). 
Suppose that we have the following: 


1. g(x) and g'(x) are not zero at any point inside (a, b). 
2. lim, ,,+ f(x)/g(x) exists. 
3. f(x) > Oand g(x) > Oas x > a*, or f(x) > œ and g(x) >was x>a*. 
Then, 
f(x) _ f(x) 
im = lm — ; 
xoat g(x) x>a* g (x) 
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Proof. For the sake of simplicity, we shall drop the + sign from a* and 
simply write x > a when x approaches a from the right. Let us consider the 
following cases: 


Case 1. f(x)>0and g(x) > 0as x >a, where a is finite. Let x € (a, b). 
By applying Cauchy’s mean value theorem on the interval [a, x] we get 


f(x) fœ Fo 
g(x) g(x)=g(a) g'(c)` 


where a <c <x. Note that f(a) = g(a) = 0, since f(x) and g(x) are continu- 
ous and their limits are equal to zero when xa. Now, as x>a,c >a; 
hence 


lim fa) = lim eM) = lim fy) 
x>a g(x) cna gi(c) x>a g'(x)" 


Case 2. f(x)—>0 and g(x)—>0 as x > œ. Let z=1/x. As x > %, z > 0. 
Then 


lim f(x) = lim fi) 
x>e g(x) z>0 8,(z)” 


(4.7) 


where fi(z)=f(1/z) and g,(z)=g(1/z). These functions are continuous 
since f(x) and g(x) are, and z +0 as z — 0 (see Theorem 3.4.2). Here, we 
find it necessary to set f,(0)=g,(0)=0 so that f(z) and g,(z) will be 
continuous at z = 0, since their limits are equal to zero. This is equivalent to 
defining f(x) and g(x) to be zero at infinity in the extended real number 
system. Furthermore, by the chain rule of Theorem 4.1.3 we have 


fe) -F[-a), 
TORTOR] 


If we now apply Case 1, we get 


fi(z) — lim fi(z) 
z>0 §(Z) z>0 gi(z) 
se Fœ) 


= Hay (4.8) 
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From (4.7) and (4.8) we then conclude that 


for. a 
ve g(a) ee BH) 


Case 3. f(x) > and g(x) >” as xa, where a is finite. Let 
lim, ,f’(x)/g'(4)] = L. Then for a given €>0 there exists a 6>0 such 
that a+6<b and 


f(x) | 
g'(x) 

if a<x<a + ô. By applying Cauchy’s mean value theorem on the interval 

[x,a + ô] we get 


L 


<; (4.9) 


f(x) -f(at+6) f'(a) 
g(x)-g(a+8) g(d)’ 


where x <d <a + 6. From inequality (4.9) we then have 
f(x) flats) — 
g(x) —g(at ô) 
for all x such that a <x <a + ô. It follows that 
„„ FO) -Na+ 8) 
x>a g(x)—g(a+ ô) 
a £0) on IUDA 
rau g(x) xu 1 g(a +8) /e(x) 
fe) 
xa g(x) 


L\|<e 


L= 


since both f(x) and g(x) tend to % as x >a. 


Case 4. f(x) > and g(x) > œ as x >, This can be easily shown by 
using the techniques applied in Cases 2 and 3. 


Case 5. lim, _,, f'(x)/g'(x) =~, where a is finite or infinite. Let us 
consider the ratio g(x)/f(x). We have that 


lim g (x) = 
x>a f'(x) 
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Hence, 
n g(x) _ i go) 
sa T a 
If A is any positive number, then there exists a ô> 0 such that 
g(x) 1 
OREI 


if a<x<a + ô. Thus for such values of x, 


0. 


which implies that 


When applying PHospital’s rule, the ratio f'(x)/g'(x) may assume the 
indeterminate from 0/0 or %/% as x >a. In this case, higher-order deriva- 
tives of f(x) and g(x), assuming such derivatives exist, will be needed. In 
general, if the first n — 1 (n > 1) derivatives of f(x) and g(x) tend simultane- 
ously to zero or to œ% as x >a, and if the nth derivatives, f(x) and g(x), 
exist and satisfy the same conditions as those imposed on f'(x) and g'(x) in 
Theorem 4.2.6, then 


lim fx) = lim fn") 
roa gla) >a BX) 


A Historical Note 


According to Eves (1976, page 342), in 1696 the Marquis de Il’Hospital 
assembled the lecture notes of his teacher Johann Bernoulli (1667-1748) into 
the world’s first textbook on calculus. In this book, the so-called l’ Hospitals 
rule is found. It is perhaps more accurate to refer to this rule as the 
Bernoulli-’ Hospital rule. Note that the name I’Hospital follows the old 
French spelling and the letter s is not to be pronounced. In modern French 
this name is spelled as Hopital. 


sin x cos x 
EXAMPLE 4.2.1. lim = lim =1. 


x>0 x x>0 1 
This is a well-known limit. It implies that sin x and x are asymptotically 
equal, that is, sin x ~x as x > 0 (see Section 3.3). 


EXAMPLE 4.2.2. lim = lim = lim =, 
x>0 x x>0 2x x>0 2 2 


We note here that l’Hospitaľ’s rule was applied twice before reaching the 
limit 5. 
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pa 


a 
EXAMPLE 4.2.3. lim —, where a > 1. 


x20 


This is of the form %/% as x > ~. Since a* =e*'®*, then 


a* ; e*¥ los “(log a) 
lim = lim = 00 
x>% X x>w 1 


This is also a well-known limit. On the basis of this result it can be shown 
that (see Exercise 4.12) the following hold: 


a* 
1. lim — =~, where a>1,m>0. 


x70 


_ logx 
2. lim ——— = 0, where m > 0. 
x> X 

EXAMPLE 4.2.4. lim, _, 9+ x*. 

This is of the form 0° as x > 0*, which is indeterminate. It can be reduced 
to the form 0/0 or »/ so that I’Hospital’s rule can apply. To do so we write 
x* as 


x* =e* log x 
However, 
log x 
x log x= i 
1/x 
which is of the form —%/ as x > 0*. By PHospitaľ’s rule we then have 


1/x 


lim —— 
x>0* —1/x 


lim (x log x) 
x>0* 


ll 


lim (-x 

T (-x) 
=0. 

It follows that 

x log x 


lim x*= lim e 
x30 x07 


= exp| lim, (x log x) 


=1. 


x+1 
EXAMPLE 4.2.5. lim x log 
x70 x—1 


This is of the form œ% x 0 as x > ©, which is indeterminate. But 


x+1 
x+1 tog = 

x log = 
x—1 1/x 
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is of the form 0/0 as x > œ. Hence, 


—2 
f x+1 > (x-1)(x+1) 
lim x log i}7 lim ee 
x70 x= x> = xX 
li a 
sae (= tee) 
= 2. 


We can see from the foregoing examples that the use of |’Hospital’s rule 
can facilitate the process of finding the limit of the ratio f(x)/g(x) as x > a. 
In many cases, it is easier to work with f'(x)/g'(x) than with the original 
ratio. Many other indeterminate forms can also be resolved by l’Hospital’s 
rule by first reducing them to the form 0/0 or %/% as was shown in 
Examples 4.2.4 and 4.2.5. 

It is important here to remember that the application of l’Hospital’s rule 
requires that the limit of f’(x)/g'(x) exist as a finite number or be equal to 
infinity in the extended real number system as x — a. If this is not the case, 
then it does not necessarily follow that the limit of f(x)/g(x) does not exist. 
For example, consider f(x)= x? sin(1/x) and g(x)=x. Here, f(x)/g(x) 
tends to zero as x > 0, as was seen earlier in this chapter. However, the ratio 


f(x) 1 1 


= 2x sin — — cos 
g'(x) x x 


has no limit as x > 0, since it oscillates inside a small neighborhood of the 
origin. 


4.3. TAYLOR’S THEOREM 


This theorem is also known as the general mean value theorem, since it is 
considered as an extension of the mean value theorem. It was formulated by 
the English mathematician Brook Taylor (1685-1731) in 1712 and has since 
become a very important theorem in calculus. Taylor used his theorem to 
expand functions into infinite series. However, full recognition of the impor- 
tance of Taylor’s expansion was not realized until 1755, when Leonhard 
Euler (1707-1783) applied it in his differential calculus, and still later, when 
Joseph Louis Lagrange (1736-1813) used it as the foundation of his theory of 
functions. 


Theorem 4.3.1 (Taylor’s Theorem). If the (n — 1)st (n > 1) derivative of 
f(x), namely f™-”(x), is continuous on the closed interval [a, b] and the nth 
derivative f(x) exists on the open interval (a, b), then for each x € [a,b] 
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we have 
7 ; (x- a)” , 
f(x) =f(a) + (x-a) f'(a) + =" (a) 
ep FOO pong) ¢ SO D pe), 


where a < é<x. 


Proof. The method to prove this theorem is very similar to the one used 
for Theorem 4.2.2. For a fixed x in [a, b] let the function y(t) be defined as 


wO =a) - [ZE] so, 


where a <t <b and 


Bilt) =F) -FA — OFE 


n 4 E et) 


= e a ae EW 


The function y,(t) has the following properties: 


1. (a) =0 and (x) =0. 
2. w(t) is a continuous function of t on [a, b]. 
3. The derivative of y(t) with respect to t exists on (a, b). This derivative 


is equal to 
(2) = 810) + O(a) 
(x-a) 
= -F(Y +F) = x=)" +(x") 
cee ae 

me: = iy 1 i n(x- t)" 1 
cor N aan 

= -ED png (a. 


(n-1)! (x-a) 
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By applying Rolle’s theorem to y,(t) on the interval [a, x] we can assert 
that there exists a value é, a < <x, such that #/(€) =0, that is, 


Cae) eae nee 
7 ae) e ek 
or 
g(a) = R= pone, (4.11) 


Using formula (4.10) in (4.11), we finally get 


f(x) = F(a) + (xa) pr(a) + Efa) 
Me (zea) 
! 


Spass Tac A $ a es (4.12) 


This is known as Taylor’s formula. It can also be expressed as 


h2 
fla +h) =f(a) +hf"(a) + 5f" Ca) 
hr} h" 
n 


Foek Gop + a Me + 6,h), (4.13) 


where h =x —a and0<6, <1. o 


In particular, if f(x) has derivatives of all orders in some neighborhood 
N (a) of the point a, formula (4.12) can provide a series expansion of f(x) for 
x€N(a) as n>, The last term in formula (4.12), or formula (4.13), is 
called the remainder of Taylor’s series and is denoted by R,,. Thus, 


R, = Ca é) 


a 
n! 


h” 
= f(a + 6,h). 
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If R, >Q as n>~%, then 


flay =flay+ E CD poa), (4.14) 
or 
o0 h” 
f(ath) =f(a) + 3 A (4.15) 


This results in what is known as Taylor’s series. Thus the validity of Taylor’s 
series is contingent on having R, — 0 as n ~~, and on having derivatives of 
all orders for f(x) in N,(a). The existence of these derivatives alone is not 
sufficient to guarantee a valid expansion. 

A special form of Taylor’s series is Maclaurin’s series, which results when 
a = 0. Formula (4.14) then reduces to 


[ee] x” 
fœ =O L ne (4.16) 
n=1 $ 
In this case, the remainder takes the form 
x” 
R, = ~=” (0,x). (4.17) 
n! 


The sum of the first n terms in Maclaurin’s series provides an approxima- 
tion for the value of f(x). The size of the remainder determines how close 
the sum is to f(x). Since the remainder depends on 6,, which lies in the 
interval (0,1), an upper bound on R, that is free of 6, will therefore be 
needed to assess the accuracy of the approximation. For example, let us 
consider the function f(x) = cos x. In this case, 


nT 
f(x) = cos x+ “a n=1,2,..., 


and 


NT 


f™(0) = cos( 5 


: 0, n odd, 
~ \(=1)"2, neven. 
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Formula (4.16) becomes 


2n 


cos x=1+ eo aa 


2 4 2n 
go ea ee oat +Ry,413 
where from formula (4.17) R,,,,, is 
LTES (2n +1)r 
Rany = Grr” Ozn+1X + a ae 


An upper bound on |R,,,,,| is then given by 


|x| 2nt+1 


R < —. 
[Rona (2n+1)! 


Therefore, the error of approximating cos x with the sum 


2 4 2n 


x n X 
Sona ha Oe Ed) 


(2n)! 


does not exceed Ixl” t/n + 1)!, where x is measured in radians. For 
example, if x = 7/3 and n = 3, the sum 


x 
Ss=1- 7 F 4 6! = 0.49996 


approximates cos(7/3) with an error not exceeding 


lx|’ 


The true value of cos(z/3) is 0.5. 


4.4. MAXIMA AND MINIMA OF A FUNCTION 


In this section we consider the problem of finding the extreme values of a 
function y=f(x) whose derivative f'(x) exists in any open set inside its 
domain of definition. 
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Definition 4.4.1. A function f: D >R has a local (or relative) maximum 
at a point x} €D if there exists a 6>0 such that f(x) <f(x,) for all 
xE N,;(x) ND. The function f has a local (or relative) minimum at x, if 
f(x) = f(x) for all x E€ Ns(x%)) A D. Local maxima and minima are referred to 
as local optima (or local extrema). o 


Definition 4.4.2. A function f: D —> R has an absolute maximum (mini- 
mum) over D if there exists a point x* € D such that f(x) <f(x* f(x) > 
f(x*)] for all x € D. Absolute maxima and minima are called absolute optima 
(or extrema). o 


The determination of local optima of f(x) can be greatly facilitated if f(x) 
is differentiable. 


Theorem 4.4.1. Let f(x) be differentiable on the open interval (a, b). If 
f(x) has a local maximum, or a local minimum, at a point x, in (a, b), then 


f'(x) = 0. 


Proof. Suppose that f(x) has a local maximum at x). Then, f(x) < f(xy) 
for all x in some neighborhood N;(x)) C (a, b). It follows that 


FŒ) —f(%o) [<0 if x>x9, 


t= ii >20 ifx<x: (2418) 


for all x in N;(xọ). As x xj, the ratio in (4.18) will have a nonpositive 
limit, and if x x9, the ratio will have a nonnegative limit. Since f’(x) 
exists, these two limits must be equal and equal to f’(x)) as x— x We 
therefore conclude that f’(x,)=0. The proof when f(x) has a local mini- 
mum is similar. o 


It is important here to note that f'(x) = 0 is a necessary condition for a 
differentiable function to have a local optimum at xọ. It is not, however, a 
sufficient condition. That is, if f’(x)) = 0, then it is not necessarily true that 
Xo is a point of local optimum. For example, the function f(x) =x? has a 
zero derivative at the origin, but f(x) does not have a local optimum there 
(why not?). In general, a value x) for which f’(x,) =0 is called a stationary 
value for the function. Thus a stationary value does not necessarily corre- 
spond to a local optimum. 

We should also note that Theorem 4.4.1 assumes that f(x) is differen- 
tiable in a neighborhood of x,. If this condition is not fulfilled, the theorem 
ceases to be true. The existence of f'(x) is not prerequisite for f(x) to have 
a local optimum at x9. In fact, f(x) can have a local optimum at x, even if 
f'(xo) does not exist. For example, f(x) = |x| has a local minimum at x = 0, 
but f’(O) does not exist. 
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We recall from Corollary 3.4.1 that if f(x) is continuous on [a, b], then it 
must achieve its absolute optima at some points inside [a, b]. These points 
can be interior points, that is, points that belong to the open interval (a, b), 
or they can be end (boundary) points. In particular, if f'(x) exists on (a, b), 
to determine the locations of the absolute optima we must solve the equation 
f'(x) =0 and then compare the values of f(x) at the roots of this equation 
with f(a) and f(b). The largest (smallest) of these values is the absolute 
maximum (minimum). In the event f’(x)#0 on (a,b), then f(x) must 
achieve its absolute optimum at an end point. 


4.4.1. A Sufficient Condition for a Local Optimum 


We shall make use of Taylor’s expansion to come up with a sufficient 
condition for f(x) to have a local optimum at x =X). 

Suppose that f(x) has n derivatives in a neighborhood N;(x,) such that 
f(x) =f" Cx) = =f (XQ) = 0, but f(x) #0. Then by Taylor’s the- 
orem we have 


h” 
F) =f) + ZSO + Hh) 


for any x in N;(xoọ), where h=x—x, and 0 < 0, < 1. Furthermore, if we 
assume that f(x) is continuous at x, then 


FO(Xq + Oh) =f™®(x0) +00), 


where, as we recall from Section 3.3, o(1) > 0 as h —> 0. We can therefore 
write 


h” 
F(x) =F (x0) = zf” (0) + Ch"): (4.19) 


In order for f(x) to have a local optimum at xo, f(x) — f(x) must have the 
same sign (positive or negative) for small values of A inside a neighborhood 
of 0. But, from (4.19), the sign of f(x) —f(x,) is determined by the sign of 
h" f(x). We can then conclude that if n is even, then a local optimum is 
achieved at x. In this case, a local maximum occurs at x if f™(x) <0, 
whereas f™(xọ)>0 indicates that x) is a point of local minimum. If, 
however, n is odd, then x, is not a point of local optimum, since f(x) — f(xy) 
changes sign around xy. In this case, the point on the graph of y=f(x) 
whose abscissa is x, is called a saddle point. 

In particular, if f’(x))=0 and f”(xọ) #0, then x, is a point of local 
optimum. When f”(x)) <0, f(x) has a local maximum at x), and when 
f'(x) > 0, f(x) has a local minimum at x). 
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EXAMPLE 4.4.1. Let f(x) =2x3 —3x?-—12x+6. Then f(x) = 6x? — 6x 
—12=0 at x= —1,2, and 


n = 12x- 6= —18 at x= —-1, 
rœ- at x=2. 


We have then a local maximum at x = —1 and a local minimum at x = 2. 


EXAMPLE 4.4.2. f(x) =x*— 1. In this case, 
f'(x) =4x° =0 at x=0, 
f" (x) =12x?=0 at x=0, 
f"(x) =24x=0 at x=0, 
fO(x) = 24. 
Then, x = 0 is a point of local minimum. 
EXAMPLE 4.4.3. Consider f(x) = (x + 5)?(x? — 10). We have 
f'(x) =5(x + 5)(x- 1)(x +2)’, 
f" (x) = 10(x + 2)(2x* + 8x-1), 
f”(x) = 10(6x* + 24x + 15). 


Here, f'(x) =0 at x = —5, —2, and 1. At x = —5 there is a local maximum, 
since f”(—5) = —270 <0. At x =1 we have a local minimum, since f” (1) = 
270 > 0. However, at x = —2 a saddle point occurs, since f”(—2)=0 and 
f'(—2) = —90 +0. 


EXAMPLE 4.4.4. f(x) =(2x+1)/(x + 4), 0<x <5. Then 


f(x) =7/(x+ 4). 


In this case, f'(x) does not vanish anywhere in (0,5). Thus f(x) has no local 
maxima or local minima in that open interval. Being continuous on [0,5], 
f(x) must achieve its absolute optima at the end points. Since f’(x) > 0, f(x) 
is strictly monotone increasing on [0,5] by Corollary 4.2.1. Its absolute 
minimum and absolute maximum are therefore attained at x =0 and x=5, 
respectively. 


4.5. APPLICATIONS IN STATISTICS 


Differential calculus has many applications in statistics. Let us consider some 
of these applications. 
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4.5.1. Functions of Random Variables 


Let Y be a continuous random variable whose cumulative distribution 
function is F(y)=P(Y <y). If F(y) is differentiable for all y, then its 
derivative F'(y) is called the density function of Y and is denoted by f(y). 
Continuous random variables for which f(y) exists are said to be absolutely 
continuous. 

Let Y be an absolutely continuous random variable, and let W be another 
random variable which can be expressed as a function of Y of the form 
W = y(Y ). Suppose that this function is strictly monotone and differentiable 
over its domain. By Theorem 3.5.1, y has a unique inverse ~t, which is also 
differentiable by Theorem 4.2.4. Let G(w) denote the cumulative distribution 
function of W. 

If w is strictly monotone increasing, then 


G(w) =P(Wsw) =P|Y< y (w)] =F[W1(w)]. 
If it is strictly monotone decreasing, then 
G(w) =P(W <w) =P|Y > y~ (w)] =1-F[w'(w)]. 


By differentiating G(w) using the chain rule we obtain the density function 
g(w) for W, namely, 


dF[y'(w)] dy™'(w) 
di '(w) dw 


d 
=f] LO (4.20) 


g(w) = 


if w is strictly monotone increasing, and 
dF lw '(w)] dyw) 
d a dw 


=-fly Gy = ) (4.21) 


g(w)=- 


if y is strictly monotone decreasing. By combining (4.20) and (4.21) we 
obtain 


ve | (4.22) 


son =F on] 


For example, suppose that Y has the uniform distribution U(0,1) whose 
density function is 

0<y<1, 

elsewhere. 


fO) = Fs 


APPLICATIONS IN STATISTICS 117 


Let W = —log Y. Using formula (4.22), the density function of W is given by 


(w) = e”, O<w<, 
E 0 elsewhere. 


The Mean and Variance of W = y(Y) 


The mean and variance of the random variable W can be obtained by using 
its density function: 


EW) = f we(w) dw, 
Var(W) =E[W-E(W)] 


= f [w- EW) P @(w) aw. 


In some cases, however, the exact distribution of Y may not be known, or 
g(w) may be a complicated function to integrate. In such cases, approximate 
expressions for the mean and variance of W can be obtained by applying 
Taylor’s expansion around the mean of Y, u. If we assume that w”(y) exists, 
then 


y(y)=p(u)+(y-u)y'(u)+o(y- u). 


If o(y — u) is small enough, first-order approximations of E(W ) and Var(W) 
can be obtained, namely, 


E(W)=d(w), since E(Y-p)=0; Var(W)=o7[b'(w)]’, (4.23) 
where o? = Var(Y), and the symbol = denotes approximate equality. If 
o(y— u) is not small enough, then higher-order approximations can be 


utilized provided that certain derivatives of y(y) exist. For example, if y”(y) 
exists, then 


by) = b(n) +(y-u)p' (u) + Hy u) y") +f- w’). 


In this case, if o[(y — )] is small enough, then second-order approximations 
can be obtained for E(W ) and Var(W ) of the form 


E(W) = b() +407%p"(m), since E(Y-p)’=07, 


Var(W) = E{Q(Y ) —E[Q(¥)]}’, 
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where 


O(Y) =o( wu) + (Y—w)b'( pw) +Y- uy y" (u). 
Thus, 


Var(W) = o° [y (WDT + 310" T Var (Y — 2)" 
+ www" (wE[(Y— w]. 
Variance Stabilizing Transformations 


One of the basic assumptions of regression and analysis of variance is the 
constancy of the variance o* of a response variable Y on which experimental 
data are obtained. This assumption is often referred to as the assumption of 
homoscedasticity. There are situations, however, in which ø? is not constant 
for all the data. When this happens, Y is said to be heteroscedastic. 
Heteroscedasticity can cause problems and difficulties in connection with the 
statistical analysis of the data (for a survey of the problems of heteroscedas- 
ticity, see Judge et al., 1980). 

Some situations that lead to heteroscedasticity are (see Wetherill et al., 
1986, page 200): 


i. The Use of Averaged Data. In many experimental situations, the data 
used in a regression program consist of averages of samples that are 
different in size. This happens sometimes in survey analysis. 

ii. Variances Depending on the Explanatory Variables. The variance of an 
observation can sometimes depend on the explanatory (or input) 
variables in the hypothesized model, as is the case with some econo- 
metric models. For example, if the response variable is household 
expenditure and one explanatory variable is household income, then 
the variance of the observations may be a function of household 
income. 

iii. Variances Depending on the Mean Response. The response variable Y 
may have a distribution whose variance is a function of its mean, that 
is, ©? =h( u), where u is the mean of Y. The Poisson distribution, for 
example, has the property that o*=y. Thus as u changes (as a 
function of some explanatory variables), then so will o°. The following 
example illustrates this situation (see Chatterjee and Price, 1977, page 
39): Let Y be the number of accidents, and x be the speed of 
operating a lathe in a machine shop. Suppose that a linear relationship 
is assumed between Y and x of the form 


Y= Bo + Bix +e, 


where e is a random error with a zero mean. Here, Y has the Poisson 
distribution with mean mu = Bo + x. The variance of Y, being equal 
to u, will not be constant, since it depends on x. 
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Heteroscedasticity due to dependence on the mean response can be 
removed, or at least reduced, by a suitable transformation of the response 
variable Y. So let us suppose that o° =h( u). Let W= W(Y). We need to 
find a proper transformation that causes W to have almost the constant 
variance property. If this can be accomplished, then W is referred to as a 
variance stabilizing transformation. 

If the first-order approximation of Var(W) by Taylor’s expansion is ade- 
quate, then by formula (4.23) we can select w so that 


hwl WD =e, (4.24) 


where c is a constant. Without loss of generality, let c = 1. A solution of 
(4.24) is given by 


du 


CS T 


Thus if W = y(Y ), then Var(W) will have a variance approximately equal to 
one. For example, if A( u) = m, as is the case with the Poisson distribution, 
then 


d 
Hn) = fA aye 


Hence, W=2vY will have a variance approximately equal to one. (In this 
case, it is more common to use the transformation W=VY which has a 
variance approximately equal to 0.25). Thus in the earlier example regarding 
the relationship between the number of accidents and the speed of operating 
a lathe, we need to regress VY against x in order to ensure approximate 
homosecdasticity. 

The relationship (if any) between o* and u may be determined by 
theoretical considerations based on a knowledge of the type of data used—for 
example, Poisson data. In practice, however, such knowledge may not be 
known a priori. In this case, the appropriate transformation is selected 
empirically on the basis of residual analysis of the data. See, for example, Box 
and Draper (1987, Chapter 8), Montgomery and Peck (1982, Chapter 3). If 
possible, a transformation is selected to correct nonnormality (if the original 
data are not believed to be normally distributed) as well as heteroscedasticity. 
In this respect, a useful family of transformations introduced by Box and Cox 
(1964) can be used. These authors considered the power family of transfor- 
mations defined by 


(Y*-1)/A, A¥0, 


ee log Y, A=0. 
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This family may only be applied when Y has positive values. Furthermore, 
since by l’Hospital’s rule 


y*-1 


lim = lim Y^ log Y=logY, 

A-0 4-0 

the Box—Cox transformation is a continuous function of A. An estimate of A 
can be obtained from the data using the method of maximum likelihood (see 
Montgomery and Peck, 1982, Section 3.7.1; Box and Draper, 1987, Section 
8.4). 


Asymptotic Distributions 


The asymptotic distributions of functions of random variables are of special 
interest in statistical limit theory. By definition, a sequence of random 
variables {Y,,}"°_, converges in distribution to the random variable Y if 


lim F,(y) = FQ) 


at each point y where F(y) is continuous, where F,(y) is the cumulative 
distribution function of Y, (n =1,2,...) and F(y) is the cumulative distribu- 
tion function of Y (see Section 5.3 concerning sequences of functions). This 
form of convergence is denoted by writing 


d 

Y, 7 Y. 

An illustration of convergence in distribution is provided by the well-known 
central limit theorem. It states that if {Y,}"_, is a sequence of independent and 
identically distributed random variables with common mean and variance, u 
and o°, respectively, that are both finite, and if Y, = X”_,Y,/n is the sample 
mean of a sample size n, then as n > , 

eet 


n 


ao/Vvn 


d 
>Z, 


where Z has the standard normal distribution N(0, 1). 
An extension of the central limit theorem that includes functions of 
random variables is given by the following theorem: 


Theorem 4.5.1. Let {Y,}"_, be a sequence of independent and identically 
distributed random variables with mean w and variance o? (both finite), and 
let Y, be the sample mean of a sample of size n. If #s(y) is a function whose 
derivative y'(y) exists and is continuous in a neighborhood of y such that 
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y'(u) #0, then as n > %, 


WEJ- s, 
TO 


Proof. See Wilks (1962, page 259). o 


On the basis of Theorem 4.5.1 we can assert that when n is large enough, 
W(Y,) is approximately distributed as a normal variate with a mean w(y) and 
a standard deviation (o/ vn )|'(u)|. For example, if #(y)=y7, then as 


n> o, 
Fey 
SEs op: 
2| wla/vn 


4.5.2. Approximating Response Functions 


Perhaps the most prevalent use of Taylor’s expansion in statistics is in the 
area of linear models. Let Y denote a response variable, such as the yield of 
a product, whose mean p(x) is believed to depend on an explanatory (or 
input) variable x such as temperature or pressure. The true relationship 
between u and x is usually unknown. However, if w(x) is considered to have 
derivatives of all orders, then it is possible to approximate its values by using 
low-order terms of a Taylor’s series over a limited range of interest. In this 
case, u(x) can be represented approximately by a polynomial of degree d 
(> 1) of the form 


d 
u(x) = Bo + > B x!, 


j=l 


where Bo, B;,..., Ba are unknown parameters. Estimates of these parame- 
ters are obtained by running n (>d + 1) experiments in which n observa- 
tions, Y4, ¥2,---,; Yn, on Y are obtained for specified values of x. This leads us 
to the linear model 


d 
Yi = Bot È Bx t+, i=1,2,...,n, (4.25) 
j=l 


where e; is a random error. The method of least squares can then be used to 
estimate the unknown parameters in (4.25). The adequacy of model (4.25) to 
represent the true mean response p(x) can be checked using the given data 
provided that replicated observations are available at some points inside the 
region of interest. For more details concerning the adequacy of fit of linear 
models and the method of least squares, see, for example, Box and Draper 
(1987, Chapters 2 and 3) and Khuri and Cornell (1996, Chapter 2). 
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4.5.3. The Poisson Process 


A random phenomenon that arises through a process which continues in time 
(or space) in a manner controlled by the laws of probability is called a 
stochastic process. A particular example of such a process is the Poisson 
process, which is associated with the number of events that take place over a 
period of time—for example, the arrival of customers at a service counter, or 
the arrival of a-rays, emitted from a radioactive source, at a Geiger counter. 

Define p,(t) as the probability of n arrivals during a time interval of 
length t. For a Poisson process, the following postulates are assumed to hold: 


1. The probability of exactly one arrival during a small time interval of 
length h is approximately proportional to h, that is, 


pi(h) = Ah + 0(h) 


as h > 0, where A is a constant. 
2. The probability of more than one arrival during a small time interval of 
length h is negligible, that is, 


È p,(h) =0(h) 


n>1 


ash 0. 

3. The probability of an arrival occurring during a small time interval 
(t,t +h) does not depend on what happened prior to t. This means that 
the events defined according to the number of arrivals occurring during 
nonoverlapping time intervals are independent. 


On the basis of the above postulates, an expression for p,(t) can be found 
as follows: For n > 1 and for small A we have approximately 


Prlt +h) =p,(t) Po(h) + Pri (4) P(A) 
=p,(t)[1-Ah+o(h)] +p -(t)[Ah + 0(h)], (4.26) 


since the probability of no arrivals during the time interval (t,t +h) is 
approximately equal to 1 —p,(h). For n = 0 we have 


P(t +h) =po(t) Pol) 


=p,(t)[1—Ah+o0(h)]. (4.27) 


APPLICATIONS IN STATISTICS 123 


From (4.26) and (4.27) we then get for n > 1, 


e =p,(t) -a+ +P a(t) v], 
and for n =0, 
PD PO pn -a4 2], 


By taking the limit as h —> 0 we obtain the derivatives 
Palt) = =AP,(t) + APr-a(t), n21 (4.28) 
Po(t) = —Apo(t)- (4.29) 
From (4.29) the solution for p,(t) is given by 
p(t)=e™, (4.30) 


since p,(t)=1 when t=O (that is, initially there were no arrivals). By 
substituting (4.30) in (4.28) when n = 1 we get 


p(t) = —Ap,(t) +Ae™. (4.31) 
If we now multiply the two sides of (4.31) by e™ we obtain 
eNpi(t) + Ap (te =A, 
or 
[e™p(t)]' =A. 
Hence, 
e“p(t) =Attc, 


where c is a constant. This constant must be equal to zero, since p,(0) = 0. 
We then have 


p(t) =Ate™™’. 


By continuing in this process and using equation (4.28) we can find p,(t), 
then p,(t),..., etc. In general, it can be shown that 


eT At)" 
Pi a OL Dine (4.32) 
n.: 
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In particular, if t= 1, then formula (4.32) gives the probability of n arrivals 
during one unit of time, namely, 


e 


n! 


p,(1) = : n=0,1,.... 


This gives the probability mass function of a Poisson random variable with 
mean A. 


4.5.4. Minimizing the Sum of Absolute Deviations 


Consider a data set consisting of n observations y,,y,,...,y,. For an 
arbitrary real number a, let D(a) denote the sum of absolute deviations of 
the data from a, that is, 


D(a) = È ly; —al. 


For a given a, D(a) represents a measure of spread, or variation, for the data 
set. Since the value of D(a) varies with a, it may be of interest to determine 
its minimum. We now show that D(a) is minimized when a = p*, where p* 
denotes the median of the data set. By definition, u* is a value that falls in 
the middle when the observations are arranged in order of magnitude. It is a 
measure of location like the mean. If we write Ya) SY < ° SY) for the 
ordered y,’s, then when n is odd, p* is the unique value y,, 1/2); whereas 
when n is even, w* is any value such that Yn /3 < M* <Wy/241)- In the latter 
case, u* is sometimes chosen as the middle of the interval. 

There are several ways to show that u* minimizes D(a). The following 
simple proof is due to Blyth (1990): 

On the interval Yay <4 <at> k=1,2,...,n — 1, we have 


D(a) = È lya —al 
i=1 

k n 

=) (a= ya) + (yaza) 

j= i=k+1 


i=1 


k n 
= ka — eves » Yi (n-k)a. 
i=1 i=k+1 


The function D(a) is continuous for all a and is differentiable everywhere 
except at y,,y,...,y,- For a#y, (=1,2,...,n), the derivative D'(a) is 
given by 


D'(a) =2(k—n/2). 
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If k#n/2, then D'(a)#0 on (Vu Yarn) and by Corollary 4.2.1, D(a) 
must be strictly monotone on [yq), Yer] k= 1,2,... 0 — 1. 

Now, when n is odd, D(a) is strictly monotone decreasing for a< 
Yin 2+1/y> because D'(a) <0 over (Yay Yæ+) for k<n/2. It is strictly 
monotone increasing for a > Yn 21/2), because D'(a) > 0 over (Yy Y+) 
for k >n/2. Hence, w* =y,, 7442) 18 a point of absolute minimum for D(a). 
Furthermore, when n is even, D(a) is strictly monotone decreasing for 
A < Yin jz» because D'(a) <0 over (Vu Yarn) for k<n/2. Also, D(a) is 
constant over (Yen 2), Yin 241), because D'(a) = 0 for k =n /2, and is strictly 
monotone increasing for a > Yen /41), because D'(a) > 0 over (45, Ye +1)) for 
k >n/2. This indicates that D(a) achieves its absolute minimum at any point 
p* such that yon 2) < B* < Yn 241)» Which completes the proof. 
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EXERCISES 


In Mathematics 


4.1. 


4.2. 


4.3. 


4.4. 


4.5. 


4.6. 


Let f(x) be defined in a neighborhood of the origin. Show that if f’(0) 
exists, then 


li 
h>0 2 


h)-f(—h 
m DAD ey, 


Give a counterexample to show that the converse is not true in general, 
that is, if the above limit exists, then it is not necessary that f’(0) exists. 


Let f(x) and g(x) have derivatives up to order n on [a,b]. Let 
h(x) =f(x)g(x). Show that 


(x) = s(n M(x) g@-h x). 
OD) = E ee) 


(This is known as Leibniz’s formula.) 


Suppose that f(x) has a derivative at a point x), a <x, <b. Show that 
there exists a neighborhood N;(x,) and a positive number A such that 


| #(%) —f(%) | <Alx — xol 
for all x € N;(xo), x # xo. 


Suppose that f(x) is differentiable on (0,~) and f'(x) 0 as x > %. 
Let g(x) =f(x + 1) — f(x). Prove that g(x) > 0 as x > %. 


Let the function f(x) be defined as 


x? —2x, x>1, 
xX — 
A) ey x<1. 


For what values of a and b does f(x) have a continuous derivative? 
Suppose that f(x) is twice differentiable on (0,%). Let mg, m4, m, be 


the least upper bounds of |f(x)|, [f’(x)|, and |f" (x)|, respectively, on 
(0, œ). 
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4.7. 


4.8. 


4.9. 


4.12. 


4.13. 


DIFFERENTIATION 
(a) Show that 
1 Mo 
IF (x)| < n +hm, 


for all x in (0,%) and for every h > 0. 
(b) Deduce from (a) that 


mî < 4mm,. 
Suppose that lim, ., ,, f'(x) exists. Does it follow that f(x) is differen- 


tiable at x)? Give a proof to show that the statement is correct or 
produce a counterexample to show that it is false. 


Show that D(a) = X7_,ly,; —a| has no derivatives with respect to a at 
Vis Ynse Yw 


Suppose that the function f(x) is such that f'(x) and f”(x) are 
continuous in a neighborhood of the origin and satisfies f(0) = 0. Show 
that 


-a (ON te 
A -i ii 


. Show that if f'(x) exists and is bounded for all x, then f(x) is 


uniformly continuous on R, the set of real numbers. 


. Suppose that g: R >R and that |g'(x)| <M for all x € R, where M is 


a positive constant. Define f(x)=x+cg(x), where c is a positive 
constant. Show that it is possible to choose c small enough so that f is 
a one-to-one function. 


Suppose that f(x) is continuous on [0, œ), f'(x) exists on (0, œ), f(0) = 0, 
and f'(x) is monotone increasing on (0,%). Show that g(x) is mono- 
tone increasing on (0, %) where g(x) =f(x)/x. 


Show that if a > 1 and m > 0, then 
(a) lim, „{a*/x") =~, 
(b) lim, _,.{(log x)/x”] = 0. 


. Apply ’Hospital’s rule to find the limit 


x 


1 
lim [1+ — 
x 


x20 
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4.15. 


4.16. 


4.20. 


4.21. 


4.22. 


(a) Find lim, _, )+(sin x)*. 
(b) Find lim, 9+(e7'/*/x). 


Show that 


lim [1 + ax + 0(x)]'“ =e", 
x0 


where a is a constant and o(x) is any function whose order of 
magnitude is less than that of x as x > 0. 


. Consider the functions f(x) = 4x73 + 6x*—10x+2 and g(x) =3x++ 


4x3 — 5x? + 1. Show that 


rO) I-O 
TOMORO 


for any x € (0,1). Does this contradict Cauchy’s mean value theorem? 


. Suppose that f(x) is differentiable for a <x <b. If f'(a) < f'(b) and y 


is a number such that f’(a)< y<f'(b), show that there exists a é, 
a<é&<b, for which f’(€)= y [a similar result holds if f'(a) >f'(b)]. 
[ Hint: Consider the function g(x) =f(x) — y(x—a). Show that g(x) 
has a minimum at é.] 


. Suppose that f(x) is differentiable on (a,b). Let x,,x,,...,x, be in 


(a,b), and let À, Aj,..., A, be positive numbers such that X; à; = 1. 
Show that there exists a point c in (a, b) such that 


EAF EAPO). 


[Note: This is a generalization of the result in Exercise 4.18.] 
Let x1, X2,..., X, and y,,y5,...,y, be in (a,b) such that x; <y; (i= 


1,2,... n). Show that if f(x) is differentiable on (a,b), then there 
exists a point c in (a, b) such that 


È [FOD -A1 =f") È Oix). 


Give a Maclaurin’s series expansion of the function f(x) = log( +x). 


Discuss the maxima and minima of the function f(x) = (x* + 3)/(x? + 2). 
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4.23. 


4.24. 


DIFFERENTIATION 


Determine if f(x) =e**/x has an absolute minimum on (0, ©). 


For what values of a and b is the function 
1 
x2 +ax+b 


f(x) 


bounded on the interval [—1,1]? Find the absolute maximum on that 
interval. 


In Statistics 


4.25. 


4.26. 


4.27. 


4.28. 


4.29. 


Let Y be a continuous random variable whose cumulative distribution 
function, F(y), is strictly monotone. Let G(y) be another strictly 
monotone, continuous cumulative distribution function. Show that the 
cumulative distribution function of the random variable G~![F(Y)] is 
G(y). 


Let Y have the cumulative distribution function 


1-e’, y>0, 
EWS 0 y <0. 


Find the density function of W = VY. 


Let Y be normally distributed with mean 1 and variance 0.04. Let 
w=yY?. 

(a) Find the density function of W. 

(b) Find the exact mean and variance of W. 


(c) Find approximate values for the mean and variance of W using 
Taylor’s expansion, and compare the results with those of (b). 


Let Z be normally distributed with mean 0 and variance 1. Let Y= Z?. 
Find the density function of Y. 
[ Note: The function ys(z) =z? is not strictly monotone for all z.] 


Let X be a random variable that denotes the age at failure of a 
component. The failure rate is defined as the probability of failure in a 
finite interval of time, say of length h, given the age of the component, 
say x. This failure rate is therefore equal to 


P(x<X<xth|X=x). 


Consider the following limit: 


1 
lim —P(x<X<x+h|X2>x). 
h=>0 h 
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4.30. 


If this limit exists, then it is called the hazard rate, or instantaneous 

failure rate. 

(a) Give an expression for the failure rate in terms of F(x), the 
cumulative distribution function of X. 

(b) Suppose that X has the exponential distribution with the cumula- 
tive distribution function 


F(x) =1-e"/"%, x20, 


where ø is a positive constant. Show that X has a constant hazard 
rate. 

(c) Show that any random variable with a constant hazard rate must 
have the exponential distribution. 


Consider a Poisson process with parameter A over the interval (0, ¢). 
Divide this interval into n equal subintervals of length h=t/n. We 
consider that we have a “success” in a given subinterval if one arrival 
occurs in that subinterval. If there are no arrivals, then we consider that 
we have a “failure.” Let Y, denote the number of “successes” in the n 
subintervals of length A. Then we have approximately 


P =r) = (7 a-a), r=0,1,...,n, 


where p,, is approximately equal to Ah = At/n. Show that 


e (At) 
lim P(Y, =r) = 2 


noo r! 


CHAPTER 5 


Infinite Sequences and Series 


The study of the theory of infinite sequences and series is an integral part of 
advanced calculus. All limiting processes, such as differentiation and integra- 
tion, can be investigated on the basis of this theory. 

The first example of an infinite series is attributed to Archimedes, who 
showed that the sum 


1 1 
1+ ++ 4+— 
4 4” 
was less than $ for any value of n. However, it was not until the nineteenth 
century that the theory of infinite series was firmly established by Augustin- 
Louis Cauchy (1789-1857). 
In this chapter we shall study the theory of infinite sequences and series, 
and investigate their convergence. Unless otherwise stated, the terms of all 
sequences and series considered in this chapter are real-valued. 


5.1. INFINITE SEQUENCES 


In Chapter 1 we introduced the general concept of a function. An infinite 
sequence is a particular function f: J*— R defined on the set of all positive 
integers. For a given n €J*, the value of this function, namely f(n), is called 
the nth term of the infinite sequence and is denoted by a,. The sequence 
itself is denoted by the symbol {a,,}"_,. In some cases, the integer with which 
the infinite sequence begins is different from one. For example, it may be 
equal to zero or to some other integer. For the sake of simplicity, an infinite 
sequence will be referred to as just a sequence. 

Since a sequence is a function, then, in particular, the sequence {a,}"_, 
can have the following properties: 


1. It is bounded if there exists a constant K > 0 such that |a,| <K for 
all n. 
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2. It is monotone increasing if a, <a,,, for all n, and is monotone 
decreasing if a, >a,,, for all n. 


3. It converges to a finite number c if lim, _,.. a, =c, that is, for a given 
e> 0 there exists an integer N such that 


la, —cl <€ ifn>N. 


In this case, c is called the limit of the sequence and this fact is 
denoted by writing a, > c as n > ~, If the sequence does not converge 
to a finite limit, then it is said to be divergent. 


4. It is said to oscillate if it does not converge to a finite limit, nor to +% 
or — as n>, 


EXAMPLE 5.1.1. Let a, =(n?+2n)/(2n? +3). Then a, 4 as n>, 
since 


1+2/n 
lim a, = lim —— 
n> noo 2+3/n 


EXAMPLE 5.1.2. Consider a, =yn+1 — vn. This sequence converges to 
zero, since 


7 (vn +1—vn)(vn+1 +n) 
w ela- 
1 
 Vn+l+yn" 


Hence, a, > 0 as n > œ. 


EXAMPLE 5.1.3. Suppose that a, =2”/n°. Here, the sequence is diver- 
gent, since by Example 4.2.3, 
. 2” 
lim — =, 


noon 


EXAMPLE 5.1.4. Let a, =(— 1)”. This sequence oscillates, since it is equal 
to 1 when n is even and to —1 when n is odd. 


Theorem 5.1.1. Every convergent sequence is bounded. 


Proof. Suppose that {a,}"_, converges to c. Then, there exists an integer 
N such that 


la, —c| <1 ifn>N. 
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For such values of n, we have 
la,| < max(lc — 1],]e +1). 
It follows that 
la,|<K 


for all n, where 


K= max(|a,| + 1,]a,| +1,...,layl +1,]e—1],]e+1]). o 


The converse of Theorem 5.1.1 is not necessarily true. That is, if a 
sequence is bounded, then it does not have to be convergent. As a counterex- 
ample, consider the sequence given in Example 5.1.4. This sequence is 
bounded, but is not convergent. To guarantee converge of a bounded 
sequence we obviously need an additional condition. 


Theorem 5.1.2. Every bounded monotone (increasing or decreasing) se- 
quence converges. 


Proof. Suppose that {a„}-; is a bounded and monotone increasing se- 
quence (the proof is similar if the sequence is monotone decreasing). Since 
the sequence is bounded, it must be bounded from above and hence has a 
least upper bound c (see Theorem 1.5.1). Thus a, <c for all n. Furthermore, 
for any given e> 0 there exists an integer N such that 


c—€<ay <C; 


otherwise c—e would be an upper bound of {a,}°_,. Now, because the 
sequence is monotone increasing, 


C— €< ay Say, Sanu. S °° <C, 


that is, 


c—e<a,<c forn >N. 


We can write 


cC—eE<a,<c+e, 
or equivalently, 


la, — c| <e ifn>N. 


This indicates that {a„}% -; converges to c. o 
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Using Theorem 5.1.2 it is easy to prove the following corollary. 
Corollary 5.1.1. 


1. If {a,}°_, is bounded from above and is monotone increasing, then 
{a,,}"_, converges to c = SUP, >1 4p- 

2. If {a,}°_, is bounded from below and is monotone decreasing, then 
{a,,}°_, converges to d = inf, . , a,. 


EXAMPLE 5.1.5. Consider the sequence {a,}”_,, where a, = V2 and a,,, 
= y2 + yan for n > 1. This sequence is bounded, since a,, < 2 for all n, as 


can be easily shown using mathematical induction: We have a, = V2 < 2. If 


a, <2, then a,,, < y 2+ V2 <2. Furthermore, the sequence is monotone 
increasing, since a, <a,,, for n=1,2,..., which can also be shown by 
mathematical induction. Hence, by Theorem 5.1.2 {a,}”_, must converge. To 


find its limit, we note that 


1 


lim a,,,= lim y2 + ya, 


n> noo 


/2+,/ lima, . 
now 


If c denotes the limit of a, as n > ~, then 


c=y2+Vve. 


By solving this equation under the condition c > V2 we find that the only 
solution is c = 1.831. 


Definition 5.1.1. Consider the sequence {a,}"_,. An infinite collection of 
its terms, picked out in a manner that preserves the original order of the 
terms of the sequence, is called a subsequence of {a,}"_,. More formally, any 
sequence of the form {b,};_,, where b, =a, such that k; <k, < © <k,< 

- is a subsequence of {a,}*_,. Note that k, >n for n> 1. o 


Theorem 5.1.3. A sequence {a,}"_, converges to c if and only if every 
subsequence of {a,}°_, converges to c. 


Proof. The proof is left to the reader. o 


It should be noted that if a sequence diverges, then it does not necessarily 
follow that every one of its subsequences must diverge. A sequence may fail 
to converge, yet several of its subsequences converge. For example, the 
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sequence whose nth term is a, =(—1)” is divergent, as was seen earlier. 
However, the two subsequences {b,}"_, and {c,}7_,, where b, =a,,,=1 and 
Cy =4o,-1 = —1(n=1,2,...), are both convergent. 

We have noted earlier that a bounded sequence may not converge. It is 
possible, however, that one of its subsequences is convergent. This is shown 
in the next theorem. 


Theorem 5.1.4. Every bounded sequence has a convergent subsequence. 


Proof. Suppose that {a,}"_, is a bounded sequence. Without loss of 
generality we can consider that the number of distinct terms of the sequence 
is infinite. (If this is not the case, then there exists an infinite subsequence of 
{a,},_, that consists of terms that are equal. Obviously, such a subsequence 
converges.) Let G denote the set consisting of all terms of the sequence. 
Then G is a bounded infinite set. By Theorem 1.6.2, G must have a limit 
point, say c. Also, by Theorem 1.6.1, every neighborhood of c must contain 
infinitely many points of G. It follows that we can find integers k; <k, < k, 
< +++ such that 


1 
ja, —e|<— forn=1,2,.... 
n 


Thus for a given e> 0 there exists an integer N > 1/e such that |a; — c| < € 
if n >N. This indicates that the subsequence {ay Sit converges to c. 
m 


We conclude from Theorem 5.1.4 that a bounded sequence can have 
several convergent subsequences. The limit of each of these subsequences is 
called a subsequential limit. Let E denote the set of all subsequential limits 
of {a,,}°_,. This set is bounded, since the sequence is bounded (why?). 


Definition 5.1.2. Let {a,}”°_, be a bounded sequence, and let E be the 
set of all its subsequential limits. Then the least upper bound of E is called 
the upper limit of {a,}*_, and is denoted by limsup,,_,.,a,. Similarly, the 
greatest lower bound of ŒE is called the lower limit of {a,}*_, and is denoted 
by liminf,, ,,. a,. For example, the sequence {a,}"°_,, where a, =(—D"[1 + 
(1/n)], has two subsequential limits, namely —1 and 1. Thus E ={-—1, J}, 
and limsup,, _,.. 4, = 1, lim inf a,= —1. o 


n> 


Theorem 5.1.5. The sequence {a,}"_, converges to c if any only if 


liminfa, = lim supa, =c. 


Las n> 


Proof. The proof is left to the reader. o 
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Theorem 5.1.5 implies that when a sequence converges, the set of all its 
subsequential limits consists of a single element, namely the limit of the 
sequence. 


5.1.1. The Cauchy Criterion 


We have seen earlier that the definition of convergence of a sequence {a,}"_, 
requires finding the limit of a, as n > ~. In some cases, such a limit may be 
difficult to figure out. For example, consider the sequence whose nth term is 


eer is Oe Gee Ses 1,2 51 
— = + pn + ee) + a = PrE . 
a, ea a: Tele (5.1) 


It is not easy to calculate the limit of a, in order to find out if the sequence 
converges. Fortunately, however, there is another convergence criterion for 
sequences, known as the Cauchy criterion after Augustin-Louis Cauchy (it was 
known earlier to Bernhard Bolzano, 1781-1848, a Czechoslovakian priest 
whose mathematical work was undeservedly overlooked by his lay and cleri- 
cal contemporaries; see Boyer, 1968, page 566). 


Theorem 5.1.6 (The Cauchy Criterion). The sequence {a,}"_, converges 
if and only if it satisfies the following condition, known as the e-condition: 
For each e> 0 there is an integer N such that 


lapa, <E  forallm>N,n>N. 


Proof. Necessity: If the sequence converges, then it must satisfy the e-con- 
dition. Let e> 0 be given. Since the sequence {a,}”_, converges, then there 
exists a number c and an integer N such that 


€ 
Ja, =c] <5 ifn>N. 


Hence, for m > N, n >N we must have 


la,, —4,| =|a,, -c te-a,| 


<la,,—cl +la,—cl <e. 


Sufficiency: If the sequence satisfies the e-condition, then it must converge. 
If the e-condition is satisfied, then there is an integer N such that for any 
given e> 0, 


la, —ayyil <€ 
for all values of n > N + 1. Thus for such values of n, 


Any, ~€ <a, <ayy, tE. (5.2) 
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The sequence {a,}"_, is therefore bounded, since from the double inequality 
(5.2) we can assert that 


la,| <max(|a,| +1,la,| +1,...,layl +1, lays, ~ €l, lays, + €l) 


for all n. By Theorem 5.1.4, {a,}7_, has a convergent subsequence {a, J -1- 
Let c be the limit of this subsequence. If we invoke again the e-condition, we 
can find an integer N’ such that 


aye age ifm>N',k,>=n=N', 
where e' < e. By fixing m and letting k, > œ we get 
la, -cl <e'<e ifm>N’. 


This indicates that the sequence {a,}*_, is convergent and has c as its limit. 
Oo 


Definition 5.1.3. A sequence {a,}"_, that satisfies the e-condition of the 
Cauchy criterion is said to be a Cauchy sequence. o 


EXAMPLE 5.1.6. With the help of the Cauchy criterion it is now possible 
to show that the sequence {a,}*_, whose nth term is defined by formula (5.1) 
is a Cauchy sequence and is therefore convergent. To do so, let m >n. Then, 


1 1 7 (=1)?" 


Am — 4, =(—1)" = + Lara 
2n+1 2n+3 2n+2p-1 


, (53) 


where p =m —n. We claim that the quantity inside brackets in formula (5.3) 
is positive. This can be shown by grouping successive terms in pairs. Thus if p 
is even, the quantity is equal to 
1 1 
| 


2n+5 2n+7 


+e’ 


1 1 
2n+1 2n+3 


+ 


1 1 
2n+2p—-3 2n+2p-1ľ 


which is positive, since the difference inside each parenthesis is positive. If 
p = 1, the quantity is obviously positive, since it is then equal to 1/(2n + 1). 
If p = 3 is an odd integer, the quantity can be written as 


+--+: 


1 1 1 1 
= + pa 
Cees 2n+3 Cees: 2n+7 


+ 


1 1 1 
= + > 
2n+2p—-5 2n+2p—3 2n+2p-1 
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which is also positive. Hence, for any p, 


1 1 (=1?" 
eS - fo + —__., 
2n+1 2n+3 2n+2p—-1 


ld, —a 


We now claim that 


To prove this claim, let us again consider two cases. If p is even, then 


| 1 1 1 
eames 2n+1 2n+3 2n+5 


1 1 1 
2nt+2p—5 2n+2p—3}] 2n+2p-1 
1 


——— # 
2n+1’ a) 


since all the quantities inside parentheses in (5.5) are positive. If p is odd, 
then 


1 1 1 
Am — a,l = = = 
2n+1 (= 2n+5 


1 1 1 
m < à 
2n+2p-—3 2n+2p-1 2n+1 


which proves our claim. On the basis of this result we can assert that for a 
given e> 0, 


lamn- a,l <€ ifm>n>N, 


where N is such that 


1 
<E, 
2N+1 
or equivalently, 
N 1 1 
> i, 
2e 2 


This shows that {a,,}°_, is a Cauchy sequence. 
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EXAMPLE 5.1.7. Consider the sequence {a,}"_,, where 


n 1 
a, =(—1) [+ 


We have seen earlier that liminf,_,,.a, = —1 and lim sup,» a, = 1. Thus 
by Theorem 5.1.5 this sequence is not convergent. We can arrive at the same 
conclusion using the Cauchy criterion by showing that the e-condition is not 
satisfied. This occurs whenever we can find an e> 0 such that for however N 
may be chosen, 


for some m > N, n > N. In our example, if N is any positive integer, then the 
inequality 


ld, -a,l =2 (5.6) 


can be satisfied by choosing m = v and n= v+ 1, where v is an odd integer 
greater than N. 


5.2. INFINITE SERIES 


Let {a„} -1 be a given sequence. Consider the symbolic expression 


Ys a, =a,+a,+ + +a 


n=1 


paa (5.7) 


By definition, this expression is called an infinite series, or just a series for 
simplicity, and a, is referred to as the nth term of the series. The finite sum 


s= i üg n= Da 


is called the nth partial sum of the series. 


Definition 5.2.1. Consider the series X%—;a„. Let s, be its nth partial 
sum (n = 1,2,...). 


1. The series is said to be convergent if the sequence {s,,}*_, converges. In 
this case, if lim,,_,.. S, =s, where s is finite, then we say that the series 
converges to s, or that s is the sum of the series. Symbolically, this is 
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expressed by writing 
s=} a, 
n=1 


2. If s, does not tend to a finite limit, then the series is said to be 
divergent. E 


Definition 5.2.1 formulates convergence of a series in terms of conver- 
gence of the associated sequence of its partial sums. By applying the Cauchy 
criterion (Theorem 5.1.6) to the latter sequence, we arrive at the following 
condition of convergence for a series: 


Theorem 5.2.1. The series L*_,a,, converges if and only if for a given 
e> 0 there is an integer N such that 


;<e foralln>m>N. (5.8) 


Inequality (5.8) follows from applying Theorem 5.1.6 to the sequence 
{s,}"_, of partial sums of the series and noting that 


for n>m. 


In particular, if n =m + 1, then inequality (5.8) becomes 
lam] <€ (5.9) 


for all m > N. This implies that lim „s 4m+1 = 0, and hence lim, _,.. a, = 0. 
We therefore conclude the following result: 


RESULT 5.2.1. If X714, is a convergent series, then lim,,_,., a, = 0. 

It is important here to note that the convergence of the nth term of a 
series to zero as n > © is a necessary condition for the convergence of the 
series. It is not, however, a sufficient condition, that is, if lim, _,.. a, = 0, then 
it does not follow that L*_,a, converges. For example, as we shall see later, 
the series L*_,(1/n) is divergent, and its nth term goes to zero as n > œ. It 
is true, however, that if lim, _,.. a, #0, then L*_,a, is divergent. This follows 
from applying the law of contraposition to the necessary condition of conver- 
gence. We conclude the following: 


1. If a, 0 as n>, then no conclusion can be reached regarding 
convergence or divergence of U7 _,a,. 
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2. If a, ~0 as n>, then 7 _,a, is divergent. For example, the series 
L* _[n/(n + 1)] is divergent, since 


; n 
lim =1 #0. 
noo N 1 
EXAMPLE 5.2.1. One of the simplest series is the geometric series, Xy 44”. 
This series is divergent if |a| > 1, since lim, _,,.a” #0. It is convergent if 
la| <1 by the Cauchy criterion: Let n >m. Then 
Sq — Sm = amt) pat? + e ta", (5.10) 


n m 


By multiplying the two sides of (5.10) by a, we get 
a(S, — Sm) =ant? +a”? te tantt, (5.11) 


If we now subtract (5.11) from (5.10), we obtain 


m+1_ ,n+1 


a a 


eee es 12 
„= —— (5.12) 


Since |a| <1, we can find an integer N such that for m > N, n >N, 
layer" 


jaj" t! < e(1—a) 


Hence, for a given e> 0, 
ls, Snl <E ifn>m>N. 


Formula (5.12) can actually be used to find the sum of the geometric series 
when |a| <1. Let m = 1. By taking the limits of both sides of (5.12) as n > ~ 
we get 


2 


a 
lim s„ =s; + , since lim a"*! =0, 
noo 1 —-a nao 
j 
=a + 
1 a 
a 
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EXAMPLE 5.2.2. Consider the series £*_,(1/n!). This series converges by 
the Cauchy criterion. To show this, we first note that 


ni=n(n—1)(n—-2)X+++X3X2X1 
>2"7! forn=1,2,.... 


Hence, for n >m, 


| | 1 1 1 
= = + Por t 
o E OREN n! 
1 1 1 
S 5m t gmr + ta 
> 1 
=2 Ai’ 
i=m+1 2 


This is a partial sum of a convergent geometric series with a =4< 1 [see 
formula (5.10)]. Consequently, |s,,—s,,| can be made smaller than any given 
e> 0 by choosing m and n large enough. 


Theorem 5.2.2. If L*_,a, and ©” _,b, are two convergent series, and if c 
is a constant, then the following series are also convergent: 


1. Y_,(ca,) =C _1a,. 
2. pia Ce a b,) = eee + eine 


Proof. The proof is left to the reader. G 


Definition 5.2.2. The series X%_;a„ is absolutely convergent if X7_;la,l 
is convergent. o 


For example, the series X7 _;[(—1)"/n!] is absolutely convergent, since 
Yd /n!) is convergent, as was seen in Example 5.2.2. 


Theorem 5.2.3. Every absolutely convergent series is convergent. 


Proof. Consider the series UF_,a,, and suppose that “?_,|a,,| is conver- 
gent. We have that 


|< È lal. (5.13) 
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By applying the Cauchy criterion to ©*_,|a,| we can find an integer N such 
that for a given e> 0, 


n 


L lal<e ifn>m>N. (5.14) 


i=m+1 


From (5.13) and (5.14) we conclude that L%_,a,, satisfies the Cauchy crite- 
rion and is therefore convergent by Theorem 5.2.1. o 


Note that it is possible that X%_;a„ is convergent while L?_,|a,| is 
divergent. In this case, the series X714, is said to be conditionally conver- 
gent. Examples of this kind of series will be seen later. 

In the next section we shall discuss convergence of series whose terms are 
positive. 


5.2.1. Tests of Convergence for Series of Positive Terms 


Suppose that the terms of the series L*_,a, are such that a, >0 for n >K, 
where K is a constant. Without loss of generality we shall consider that 
K = 1. Such a series is called a series of positive terms. 

Series of positive terms are interesting because the study of their conver- 
gence is comparatively simple and can be used in the determination of 
convergence of more general series whose terms are not necessarily positive. 
It is easy to see that a series of positive terms diverges if and only if its sum 
is +o. 

In what follows we shall introduce techniques that simplify the process of 
determining whether or not a given series of positive terms is convergent. We 
refer to these techniques as tests of convergence. The advantage of these 
tests is that they are in general easier to apply than the Cauchy criterion. 
This is because evaluating or obtaining inequalities involving the expression 
X m+14; in Theorem 5.2.1 can be somewhat difficult. The tests of conver- 
gence, however, have the disadvantage that they can sometime fail to 
determine convergence or divergence, as we shall soon find out. It should be 
remembered that these tests apply only to series of positive terms. 


The Comparison Test 


This test is based on the following theorem: 


Theorem 5.2.4. Let L7_,a, and L7_,b, be two series of positive terms 
such that a, <b, for n >No, where N, is a fixed integer. 


i fX? 
ii. If LU _,a, is divergent, then X7 


10, converges, then so does z 


* -10, is divergent too. 
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Proof. We have that 


Yes } b, Mora Sn SN. (5.15) 
i=m+1 i=m+1 


If U7 _,5, is convergent, then for a given e> 0 there exists an integer N, such 
that 


Z b<e forn>m>N,. (5.16) 


i=m+1 


From (5.15) and (5.16) it follows that if n >m >N, where N = max(N,, N,), 
then 


which proves (i). 
The proof of (ii) follows from applying the law of contraposition to (i). 
Oo 


To determine convergence or divergence of U;_,a,, we thus need to have 
in our repertoire a collection of series of positive terms whose behavior 
(with regard to convergence or divergence) is known. These series can then 
be compared against Ly _,a,. For this purpose, the following series can be 
useful: 


a. b> _,1/n. This is a divergent series called the harmonic series. 
b. ©%_,1/n*. This is divergent if k <1 and is convergent if k > 1. 


To prove that the harmonic series is divergent, let us consider its nth 
partial sum, namely, 
1 


= D> 


n 
i=1 l 


Let A > 0 be an arbitrary positive number. Choose n large enough so that 
n > 2”, where m > 2A. Then for such values of n, 


1 1 1 E OE EN 
Sa > 1+ + + + + + + seses 
2 |; 4 E 6 7 8 
1 1 
+ + 
[n =| 
1 2 4 27"! m 


146 INFINITE SEQUENCES AND SERIES 


Since A is arbitrary and s, is a monotone increasing function of n, inequality 
(5.17) implies that s, >% as n > œ. This proves divergence of the harmonic 
series. 

Let us next consider the series in (b). If k<1, then 1/n*>1/n and 
&%_,(1/n*) must be divergent by Theorem 5.2.4(ii). Suppose now that k > 1. 
Consider the nth partial sum of the series, namely, 


1 


~ 


2-1 4 
ss je 
i=1 
1 1 1 1 1 1 
ae ae a a 
1 1 
Sere euska - z 
(ae) (2” =1) 
1 1 1 1 1 1 
<it(+5 tata r eae 
1 1 
+ zoar 
(2”=) (2”=) 
2 gm-l 
m a ny en a nT 
2 4 (2-1) 
m 
= ya (5.18) 


where a=1/2*~!. But the right-hand side of (5.18) represents the mth 
partial sum of a convergent geometric series (since a < 1). Hence, as m > , 
the right-hand side of (5.18) converges to 


ai! =—— (see Example 5.2.1). 
. a 


Thus the sequence {s/,}*_, is bounded. Since it is also monotone increasing, it 
must be convergent (see Theorem 5.1.2). This proves convergence of the 
series ©*_,(1/n*) for k > 1. 
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Another version of the comparison test in Theorem 5.2.4 that is easier to 
implement is given by the following theorem: 


Theorem 5.2.5. Let L%_,a, and L7_,b, be two series of positive terms. If 
there exists a positive constant / such that a, and Jb, are asymptotically 
equal, a, ~/b, as n > & (see Section 3.3), that is, 


a, 


lim — =], 


00 
n> n 


then the two series are either both convergent or both divergent. 


Proof. There exists an integer N such that 


wize N 
— -I|< >= > 
5: 5 ifn ; 
or equivalently, 
l a, 3l " N 
= ao SS >N. 
ae 5 whenever n 


If LF _,a, is convergent, then L*_,b, is convergent by a combination of 
Theorem 5.2.2(1) and 5.2.4(i), since b, <(2/Da,. Similarly, if L?_,b, con- 
verges, then so does L*_,a,, since a, <(31/2)b,. If L%_,a,, is divergent, 
then L*_,b, is divergent too by a combination of Theorems 5.2.2(1) and 
5.2.4(ii), since b, > (2/3))a,. Finally, *_,a, diverges if the same is true of 
Erb, since a, > (1/2)b,,. o 


n=1"%n? 


EXAMPLE 5.2.3. The series £%_,(1 + 2)/(n? + 2n + 1) is convergent, since 


n+2 1 
nm+2n+1 “w 


as n > ©, 


which is the nth term of a convergent series [recall that ©*_,(1/n‘) is 
convergent if k > 1]. 


EXAMPLE 5.2.4. Xr_;1/yn(n +1) is divergent, because 


1 1 


yn(n + 1) “hn 


which is the nth term of the divergent harmonic series. 


as n > %, 
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The Ratio or d’Alembert’s Test 


This test is usually attributed to the French mathematician Jean Baptiste 
d’Alembert (1717-1783), but is also known as Cauchy’s ratio test after 
Augustin-Louis Cauchy (1789-1857). 


Theorem 5.2.6. Let U>_,a, be a series of positive terms. Then the 
following hold: 


1. The series converges if limsup,,_, .(a,,,;/a,) < 1 (see Definition 5.1.2). 
2. The series diverges if liminf,, _, .(a,,,,/a,) > 1 (see Definition 5.1.2). 
3. If liminf,, ,(a,,,/a,) <1<limsup, .,.(a,,,/a,), no conclusion can 


be made regarding convergence or divergence of the series (that is, the 
ratio test fails). 


In particular, if lim, _,.{a,,,,/a,) =r exists, then the following hold: 


1. The series converges if r <1. 
2. The series diverges if r > 1. 
3. The test fails if r= 1. 


Proof. Let p =liminf,, .,.(a,,,/a,), q = limsup, ., {@,4,/a,). 


1. If q <1, then by the definition of the upper limit (Definition 5.1.2), 
there exists an integer N such that 


— <q' forn>N, (5.19) 


where q’ is chosen such that q <q' <1. Uf a,,,/a, 2q' for infinitely 
many values of n, then the sequence {a,,,,/a,}/_, has a subsequential 
limit greater than or equal to q’, which exceeds q. This contradicts the 
definition of q.) From (5.19) we then get 


1 
an+1 <4nq 


’ 12 
An42 <4n41d <anq -, 


Anim <img <ayq'”", 
where m > 1. Thus for n >N, 


a, <ayq'"-%) =ay(q') ~q". 
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Hence, the series converges by comparison with the convergent geomet- 
ric series Ly _,q'", since q’ <1. 

If p> 1, then in an analogous manner we can find an integer N such 
that 


>p' forn >N, (5.20) 


where p’ is chosen such that p > p’ > 1. But this implies that a, cannot 
tend to zero as n > %, and the series is therefore divergent by Result 
§:2.1. 

If p<1<q, then we can demonstrate by using an example that the 
ratio test is inconclusive: Consider the two series L*_,(1/n), 
LX _,(./n’). For both series, p=q=1 and hence p<1<gq, since 
lim,, ,.44,4,/4,) = 1. But the first series is divergent while the second 
is convergent, as was seen earlier. o 


EXAMPLE 5.2.5. Consider the same series as in Example 5.2.2. This series 
was shown to be convergent by the Cauchy criterion. Let us now apply the 
ratio test. In this case, 


: Any) i 1 1 
lim = lim —— / — 
n>% An noo (n+ 1)! n! 


= lim 
noon 


< é 
1 


which indicates convergence by Theorem 5.2.6(1). 


Nurcombe (1979) stated and proved the following extension of the ratio 


test: 


Theorem 5.2.7. Let ,_,a, be a series of positive terms, and k be a fixed 
positive integer. 


1. 
2. 


If lim, _,.(a,,,/4,) <1, then the series converges. 
If lim, _,.(a,,,/4a,) > 1, then the series diverges. 


This test reduces to the ratio test when k = 1. 


The Root or Cauchy’s Test 


This is a more powerful test than the ratio test. It is based on the following 
theorem: 
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Theorem 5.2.8. Let L%_,a, be a series of positive terms. Let 
limsup,, ,..4,/” = p. Then we have the following: 


1. The series converges if p< 1. 
2. The series diverges if p> 1. 
3. The test is inconclusive if p= 1. 


In particular, if lim al/" = 7 exists, then we have the following: 


noon 


1. The series converges if 7 < 1. 
2. The series diverges if T> 1. 
3. The test is inconclusive if t= 1. 


Proof. 
1. As in Theorem 5.2.6(1), if p< 1, then there is an integer N such that 
a/"<p'  forn=N, 
where p’ is chosen such that p < p’ < 1. Thus 


a, <p" forn >N. 


n 


The series is therefore convergent by comparison with the convergent 
geometric series X, -1 p’”, since p’ <1. 
2. Suppose that p> 1. Let e> 0 be such that e< p-— 1. Then 


al/">p—e>1 
for infinitely many values of n (why?). Thus for such values of n, 
ay, > ( p— e)", 


which implies that a, cannot tend to zero as n > œ and the series is 
therefore divergent by Result 5.2.1. 

3. Consider again the two series L?_,(1/n),L%_,(1./n’). In both cases 
p= 1 (see Exercise 5.18). The test therefore fails, since the first series is 
divergent and the second is convergent. o 


NoTE 5.2.1. We have mentioned earlier that the root test is more 
powerful than the ratio test. By this we mean that whenever the ratio test 
shows convergence or divergence, then so does the root test; whenever the 
root test is inconclusive, the ratio test is inconclusive too. However, there are 
situations where the ratio test fails, but the root test doe not (see Example 
5.2.6). This fact is based on the following theorem: 
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Theorem 5.2.9. If a, > 0, then 


Any Any 


< liminfa}/” < limsupa)/” < lim sup 


noo 


lim inf 
noo 


n n> noo n 


Proof. It is sufficient to prove the two inequalities 


a 
lim supa!/" < limsup ~~, (5.21) 
a 


n> noo n 


an+1 


liminf < liminfa)/”. (5.22) 


noo a, n> 


Inequality (5.21): Let q = limsup,, ,.{a,,,/a,). If q =~, then there is noth- 
ing to prove. Let us therefore consider that q is finite. If we choose q’ such 
that q <q’, then as in the proof of Theorem 5.2.6(1), we can find an integer 
N such that 


a, <ay(q') q" forn>N. 


Hence, 
a ~ lyn ' 
ae [an (4 ) `] q'. (5.23) 


As n > %, the limit of the right-hand side of inequality (5.23) is q’. It follows 
that 


lim supa}/” <q’. (5.24) 


n> 


Since (5.24) is true for any q' >q, then we must also have 


lim supal/" <q. 


n> 


Inequality (5.22): Let p=liminf,,_,.(a,,,/a,). We can consider p to be 
finite (if p = %, then q =~ and the proof of the theorem will be complete; if 
p= —®, then there is nothing to prove). Let p’ be chosen such that p’ <p. 
As in the proof of Theorem 5.2.6(2), we can find an integer N such that 


Any 


Tip"  forn=N. (5.25) 


n 


From (5.25) it is easy to show that 


a,>ap(p') “p” for n>N. 
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Hence, for such values of n, 


n nN] og 
a aa | pe 
Consequently, 


liminfa!/" >p'. (5.26) 


noo 
Since (5.26) is true for any p' <p, then 


lim infa}/” >p. 


noo 


From Theorem 5.2.9 we can easily see that whenever q<1, then 
lim sup,, a1” < 1; whenever p > 1, then lim sup, .,.. a}/” > 1. In both cases, 
if convergence or divergence of the series is resolved by the ratio test, then it 
can also be resolved by the root test. If, however, the root test fails (when 
limsup,, ,..4,/”" = 1), then the ratio test fails too by Theorem 5.2.6(3). On the 
other hand, it is possible for the ratio test to be inconclusive whereas the root 
test is not. This occurs when 


any 


soos sos f ; An+1 
lim inf < liminfa!/” < limsupa!/” <1 < lim sup ——. o 


=> an MR noo noo n 


EXAMPLE 5.2.6. Consider the series 17 _,(a" +b”), where 0<a<b<1. 
This can be written as ©7_,c,,, where for n > 1, 


as a"*)/2 if n is odd, 
ý br? if n is even. 


Now, 


Cutt (b/a)? if n is odd, 


Cn a(a/b)"? if n is even, 


iyn _ J a@*D/@” if n is odd, 
Ch Ti 1/2 3 R 
b if n is even. 
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As n>, c,,,/Cc, has two limits, namely 0 and %; c//” has two limits, a'/* 
and b'/*. Thus 


c 
: š n+1 
lim inf —— = 0, 


n> Ch 

: Cn+1 

lim sup = 0, 
nwo n 


limsupc,/"=b'”? <1. 


n> 


Since 0 < 1 < %, we can clearly see that the ratio test is inconclusive, whereas 
the root test indicates that the series is convergent. 


Maclaurin’s (or Cauchy’s) Integeral Test 


This test was introduced by Colin Maclaurin (1698-1746) and then rediscov- 
ered by Cauchy. The description and proof of this test will be given in 
Chapter 6. 


Cauchy’s Condensation Test 


Let us consider the following theorem: 


Theorem 5.2.10. Let }X%_;a,„ be a series of positive terms, where a, is a 
monotone decreasing function of n (=1,2,...). Then X714, converges or 
diverges if and only if the same is true of the series X% —12”azn. 


Proof. Let s, and t„ be the nth and mth partial sums, respectively, of 
Ln-14, and UF _ 2a. If m is such that n <2”, then 


S, <a + (a, +a3) + (a, +a; +a +a) 
+o + (Gym + aym, t Haymana) 
<a + 2a, + 4a, + ++ +2”dyn = tp. (5.27) 
Furthermore, if n > 2”, then 


S, >24 +a, + (a3 +44) +: + (gm gy H Hazm) 
ay Lin 
=z tay t 2a, t A aie (5.28) 


If U%_,2"a,, diverges, then t„ > © as m —> œ. Hence, from (5.28), s, > © as 
n > œ, and the series U7 _,a,, is also divergent. 
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Now, if £7 _,2"aj. converges, then the sequence {t}, _, is bounded. From 
(5.27), the sequence {s„} -1 is also bounded. It follows that L%_,a, is a 
convergent series (see Exercise 5.13). oO 


EXAMPLE 5.2.7. Consider again the series £*_,(1/n*). We have already 
seen that this series converges if k>1 and diverges if k<1. Let us now 
apply Cauchy’s condensation test. In this case, 


2 2”agn 2 2” gnk = 2 2nk-1) 
ye 


n= n=1 


is a geometric series L7_,b", where b = 1/2471, If k <1, then b > 1 and the 
series diverges. If k > 1, then b <1 and the series converges. It is interesting 
to note that in this example, both the ratio and the root tests fail. 


The following tests enable us to handle situations where the ratio test fails. 
These tests are particular cases on a general test called Kummer’s test. 
Kummer’s Test 
This test is named after the German mathematician Ernst Eduard Kummer 


(1810-1893). 


Theorem 5.2.11. Let &*_,a, and L*_,b, be two series of positive terms. 
Suppose that the series U*_,b, is divergent. Let 


1 a, 1 
lim — =A, 
a 1 


n-o 


Then X% _,a, converges if A >0 and diverges if A <0. 


Proof. Suppose that A > 0. We can find an integer N such that for n >N, 


1 a, 1 A 56 
= Piet : 
b, Anyi basi 2 ( ) 
Inequality (5.29) can also be written as 
2 a, anti 
n a i 5.30 
Any) À b, Deis | ( ) 
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If s, is the nth partial sum of L*%_,a,,, then from (5.30) and for n >N, 


2 n+1 aii a; 
Snot SSN tT 2 | - +), 


À j=N+2 bi b; 
that is, 
E ee 2 ( aäy41 _ Gat 
ASON E Ap a 
(5.31) 
2 ayy 


for n>N. 


Sn+1 SSyei + Vb 
N+1 


Inequality (5.31) indicates that the sequence {s,}*_, is bounded. Hence, the 
series L*_,a, is convergent (see Exercise 5.13). 
Now, let us suppose that A <0. We can find an integer N such that 


ay, 


1 


1 
= <0 for n>N. 
b, an+1 Bast 


Thus for such values of n, 


a, 
an+1 > p Prr (5.32) 


It is easy to verify that because of (5.32), 


An+1 


(5.33) 


N+1 


for n > N +2. Since X7_;b,„ is divergent, then from (5.33) and the use of the 
comparison test we conclude that U7 _,a,, is divergent too. o 


Two particular cases of Kummer’s test are Raabe’s test and Gauss’s test. 


Raabe’s Test 
This test was established in 1832 by J. L. Raabe. 


Theorem 5.2.12. Suppose that L%_,a, is a series of positive terms and 
that 


ay T 1 
=1+-—+0 as n> ©, 
n n 


Then }%_;4„ converges if r> 1 and diverges if 7< 1. 
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Proof. We have that 


a, 1 
=l1+-—+ of =] 
Ant n 
This means that 
an T 
n -1--]-0 (5.34) 
An+1 n 


as n > %, Equivalently, (5.34) can be expressed as 


na, 
im ( -n-1)=7-1, (5.35) 
Tima An+1 


Let b, =1/n in (5.35). This is the nth term of a divergent series. If we now 
apply Kummer’s test, we conclude that the series L*_,a, converges if 
tT—1>0 and diverges if 7— 1 <0. Oo 


Gauss’s Test 


This test is named after Carl Friedrich Gauss (1777-1855). It provides a 
slight improvement over Raabe’s test in that it usually enables us to handle 
the case r= 1. For such a value of 7, Raabe’s test is inconclusive. 


Theorem 5.2.13. Let L%_,a, be a series of positive terms. Suppose that 


an 


0 1 
-1+2 0f}; 5>0. 


Anyi 
Then Ui _,a, converges if 0> 1 and diverges if 0 < 1. 


Proof. Since 


then by Raabe’s test, L*_,a, converges if 0> 1 and diverges if 0< 1. Let us 
therefore consider 0 = 1. We have 
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Put b, =1/(n logn), and consider 


1 a, 1 
lim | — — 


nae b, an+1 Bast 


1 1 
= lim [nign t+ i +0(F7]]- 0+ Dien +0 
n>% n n 
n 1 
= Jim Jon Dior + Croen o| zar) = 1 


This is true because 


lim [on + 1)log | =-1 (by l’Hospital’s rule) 


n+1 


and 
1 
lim (n log nyo =] =0 [see Example 4.2.3(2)]. 
n>% n 


Since X? [1/(nlogn)] is a divergent series (this can be shown by using 
Cauchy’s condensation test), then by Kummer’s test, the series L*_,a, is 
divergent. o 


EXAMPLE 5.2.8. Gauss established his test in order to determine the 
convergence of the so-called hypergeometric series. He managed to do so in 
an article published in 1812. This series is of the form 1 + X7 —14„, where 


paps a(a+1)\(a+2)= (a+n—-1)6(B+1)(B+2)=(B+n-1) 
ý n!y(y+1)(y+2):=(y+n-1) 


3 


n=1,2,..., 
where a, B,y are real numbers, and none of them is zero or a negative 
integer. We have 
an (n+1)(n+ 7y) n?+(y+1)n+y 
a,ıı (n+a)(n+ß) n+(a+ß)n+aß 


+1l-a- 1 
ee = +f z): 
n n 


In this case, 0= y+1—a—f8 and ô= 1. By Gauss’s test, this series is 
convergent if 0> 1, or y> a+ ß, and is divergent if 0 < 1, or y< a + B. 
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5.2.2. Series of Positive and Negative Terms 


Consider the series L*_,a,, where a, may be positive or negative for n > 1. 
The convergence of this general series can be determined by the Cauchy 
criterion (Theorem 5.1.6). However, it is more convenient to consider the 
series *_,la,| of absolute values, to which the tests of convergence in 
Section 5.2.1 can be applied. We recall from Definition 5.2.2 that if the latter 
series converges, then the series L*_,a, is absolutely convergent. This is a 
stronger type of convergence than the one given in Definition 5.2.1, since by 
Theorem 5.2.3 convergence of ©?_,|a,,| implies convergence of X*,_,a,,. The 
converse, however, is not necessarily true, that is, convergence of X714, 
does not necessarily imply convergence of ©*_,|a,|. For example, consider 
the series 


a 1 1 1 
2 =]1-->-+-7-z tn. (5.36) 


This series is convergent by the result of Example 5.1.6. It is not, however, 
absolutely convergent, since X%_;[1/(2n — 1)] is divergent by comparison 
with the harmonic series X7_;(1/n), which is divergent. We recall that a 
series such as (5.36) that converges, but not absolutely, is called a condition- 
ally convergent series. 

The series in (5.36) belongs to a special class of series known as alternating 
series. 


Definition 5.2.3. The series X}_;(—1)”7'a 
called an alternating series. o 


where a, >0 for n> 1, is 


n? 


The following theorem, which was established by Gottfried Wilhelm 
Leibniz (1646-1716), can be used to determine convergence of alternating 
series: 


Theorem 5.2.14. Let ©*_,(—1)""'a, be an alternating series such that 
the sequence {a„} -1 is monotone decreasing and converges to zero as 
n > ©, Then the series is convergent. 


Proof. Let s„ be the nth partial sum of the series, and let m be an integer 
such that m <n. Then 


n 


Sn Sin = yy (-1) a, 


i=m+1 


(-1)" [anit Zam + (1) a]. (5.37) 
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Since {a,}*_, is monotone decreasing, it is easy to show that the quantity 
inside brackets in (5.37) is nonnegative. Hence, 


-m-1 
lsn = Sal Sam — Oman t 0 +(-1)" 4 n. 
Now, if n —m is odd, then 
Is, = ial = An+1 7 (am+2 —am+3) ar he = (4,4 —a,) 
San 41> 

If n —m is even, then 

Is, = Sml =Amn+1 7 (am+2 —am+3) =. —(4,_2 —a,-;) Tan 

Sanit: 


Thus in both cases 
Ls, z Sml < An+1 $ 


Since the sequence {a,}"_, converges to zero, then for a given e> 0 there 
exists an integer N such that for m > N, a,,,, < €. Consequently, 


ls, Spl <E ifn>m>N. 


By Theorem 5.2.1, the alternating series is convergent. o 


EXAMPLE 5.2.9. The series given by formula (5.36) was shown earlier to 
be convergent. This result can now be easily verified with the help of 
Theorem 5.2.14. 


EXAMPLE 5.2.10. The series £%_,(—1)"/n* is absolutely convergent if 
k > 1, is conditionally convergent if 0 < k < 1, and is divergent if k < 0 (since 
the nth term does not go to zero). 


EXAMPLE 5.2.11. The series £*_,(—1)"/(V/n logn) is conditionally con- 
vergent, since it converges by Theorem 5.2.14, but the series of absolute 
values diverges by Cauchy’s condensation test (Theorem 5.2.10). 


5.2.3. Rearrangement of Series 


One of the main differences between infinite series and finite series is that 
whereas the latter are amenable to the laws of algebra, the former are not 
necessarily so. In particular, if the order of terms of an infinite series is 
altered, its sum (assuming it converges) may, in general, change; or worse, the 
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altered series may even diverge. Before discussing this rather disturbing 
phenomenon, let us consider the following definition: 


Definition 5.2.4. Let J* denote the set of positive integers and 7 _,a, 
be a given series. Then a second series such as L%_,b, is said to be a 
rearrangement of L*_,a, if there exists a one-to-one and onto function 


f: J*—>J* such that b, = ay, for n> 1. 
For example, the series 


1+4-4+i+i- iten, (5.38) 


where two positive terms are followed by one negative term, is a rearrange- 
ment of the alternating harmonic series 


1=i+i- ii- oe, (5.39) 


The series in (5.39) is conditionally convergent, as is the series in (5.38). 
However, the two series have different sums (see Exercise 5.21). o 


Fortunately, for absolutely convergent series we have the following 
theorem: 


Theorem 5.2.15. If the series X%_;a„ is absolutely convergent, then any 
rearrangement of it remains absolutely convergent and has the same sum. 

Proof. Suppose that X% _;a,„ is absolutely convergent and that L%_,b, is a 
rearrangement of it. By Theorem 5.2.1, for a given e> 0, there exists an 


integer N such that for all n >m >N, 


€ 
y la,| < 2° 
i=m+1 
We then have 
ms E 
Y lamy <>  ifm>N 
k=1 2 


Now, let us choose an integer M large enough so that 


{1,2,..., N+1} C{f(1), f(2),..., f(M)}. 


It follows that if n > M, then f(n) > N +2. Consequently, for n >m >M, 


M 
z 
a 
M 
2 


M 


lA 
Ms 


la <—. 
N+k+1l 5 
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This implies that the series L?_,|b,| satisfies the Cauchy criterion of Theo- 
rem 5.2.1. Therefore, X7; b,„ is absolutely convergent. 

We now show that the two series have the same sum. Let s = U7 _,a,,, and 
s, be its nth partial sum. Then, for a given e> Q there exists an integer N 
large enough so that 


| + | 
S s| < . 
N+1 2 


If t, is the nth partial sum of X7 then 


n=1 Dis 
lt, —sl <|t, —Syail + sya. —sl- 


By choosing M large enough as was done earlier, and by taking n >M, 
we get 


n N+1 
tasya = bb La; 
i=1 i=1 
N+1 
x afi T È a; 
i=1 i=1 
E € 
=) levels 
k=1 
since if n >M, 
(41,475.60, Aysi} C {aja Apes +++ Af 
Hence, for n > M, 
It, —s| <e, 
which shows that the sum of the series X% _,b, is s. o 


Unlike absolutely convergent series, those that are conditionally conver- 
gent are susceptible to rearrangements of their terms. To demonstrate this, 
let us consider the following alternating series: 


i oe a 
han -S Vn 


This series is conditionally convergent, since it is convergent by Theorem 
5.2.14 while £*%_,(1/vn) is divergent. Let us consider the following rear- 
rangement: 


-=+ (5.40) 
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in which two positive terms are followed by one that is negative. Let s3, 
denote the sum of the first 37n terms of (5.40). Then 


1 1 1 1 1 
Hy gilr r 


1 iş eee 
S3n 3 


+ 


1 1 1 
+ pen 
V4n-3 vV4n-1 Vn 


-h-r 1 -5| 
v2 v3 v4 V2n-1 Vn 
1 1 1 1 
peer nr eA fae] 
1 1 1 

a et OS eE 


where s,,, is the sum of the first 2n terms of the original series. We note that 


n 
$3, San + ———-. .. 5.41 
3n 2n 4n—1 ( ) 


If s is the sum of the original series, then lim „s 5, =s in (5.41). But, since 


li 2 
im —_—_ 
n>% V4n—1 


= 0 
? 


oo 


the sequence {s4,,}"_, is not convergent, which implies that the series in (5.40) 
is divergent. This clearly shows that a rearrangement of a conditionally 
convergent series can change its character. This rather unsettling characteris- 
tic of conditionally convergent series is depicted in the following theorem due 
to Georg Riemann (1826-1866): 


Theorem 5.2.16. A conditionally convergent series can always be rear- 
ranged so as to converge to any given number s, or to diverge to +% or 
to = %. 


Proof. The proof can be found in several books, for example, Apostol 
(1964, page 368), Fulks (1978, page 489), Knopp (1951, page 318), and Rudin 
(1964, page 67). o 


5.2.4. Multiplication of Series 


Suppose that X7_;a„ and X%_—;1b„ are two series. We recall from Theorem 
5.2.2 that if these series are convergent, then their sum is a convergent series 
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obtained by adding the two series term by term. The product of these two 
series, however, requires a more delicate operation. There are several ways 
to define this product. We shall consider the so-called Cauchy’s product. 
Definition 5.2.5. Let L*_,a, and L*_,b, be two series in which the 


summation index starts at zero instead of one. Cauchy’s product of these two 
series is the series L_ c,, where 


n 
Ca = È abako n=0,1,2,..., 
k=0 
that is, 
Y c, =abo + (aob; +a;bo) + (aob, +.a,b, +azbo) +. 
n=0 


Other products could have been defined by simply adopting different ar- 
rangements of the terms that make up the series X% _9c,. o 


The question now is: under what condition will Cauchy’s product of two 
series converge? The answer to this question is given in the next theorem. 


Theorem 5.2.17. Let X% -o€,„ be Cauchy’s product of L%_,a, and X7 -o b,- 
Suppose that these two series are convergent and have sums equal to s and t, 
respectively. 


1. If at least one of Xy-oọa, and X%-ob, converges absolutely, then 
Er oC, converges and its sum is equal to st (this result is known as 
Mertens’s theorem). 

2. If both series are absolutely convergent, then X% oC, converges abso- 
lutely to the product st (this result is due to Cauchy). 


Proof. 


1. Suppose that L*_)a,, is the series that converges absolutely. Let s,, t,, 


and u, denote the partial sums X; o4; L7_)b;, and Xf oC; respec- 


tively. We need to show that u, > st as n > œ. We have that 
Uy =Agby + (aob + a,by) + © +(agb, +4,b,_1, + ++ +a,bo) 
=at, tat,- to +a,t. (5.42) 


Let 8, denote the remainder of the series X% ob, with respect to ¢,, 
that is, B, =t—t, (n =0,1,2,...). By making the proper substitution in 
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(5.42) we get 


Un =a(t— Bn) rai Bn—1) lac +a, (t= Bo) 


= 18, — (ao Bn +4 By-1 + ta Bo). (5.43) 


Since s,, > 5 as n > %, the proof of (1) will be complete if we can show 
that the sum inside parentheses in (5.43) goes to zero as n >». We 
now proceed to show that this is the case. 

Let e> 0 be given. Since the sequence { £,}"_, converges to zero, 
there exists an integer N such that 


|B,|<e ifn>N. 


Hence, 
[49 B, +a; Bn-1 + + +a, Bo | 
<|a, Bo +4, -1 By qoe +a,„-y By | 
+0 -w-1 Ena + 4an-n-2 Byer fe +09 Bp | 
n-N-1 
<la, By +4,_; By + °° +4,_y By |+€ 3 |a;| 
i=0 
n 
<B }, la;|+es*, (5.44) 
i=n-N 

where B = max(|6ol,|6il,.-., | Bnl) and s* is the sum of the series 


Dola lŒ? o4, is absolutely convergent). Furthermore, because of 
this and by the Cauchy criterion we can find an integer M such that 


n 
L la|<e ifn-N>M+1. 
i=n—N 


Thus when n >N +M + 1 we get from inequality (5.44) 
| a, Bn + ay B,-1 He ta, Bo | < e(B +s*). 
Since e can be arbitrarily small, we conclude that 


lim (ao B, +a; Bn-1 + +a, Bo) =0. 
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2. Let v, denote the nth partial sum of X7_ọlc;|. Then 


v, =|aobo| + laogb; +a,bo|+ = +la ob, + aib,- + +4,b9| 
<|4o|[bo| + [albi] +a lbo] 
+e +[ao|[b,| +a, |[B,-1|+ e +14, llbol 


=a, Fla a ios 
where tž = Di_,|b,|, k=0,1,2,...,n. Thus, 


v, < (laol + lal + + + lal) ež 


<s*t* for all n, 


where ¢* is the sum of the series L*_,|b,|, which is convergent by 
assumption. We conclude that the sequence {u,}*_) is bounded. Since 
v, = 0, then by Exercise 5.12 this sequence is convergent, and therefore 
Xe oC, converges absolutely. By part (1), the sum of this series is st. 

oO 


It should be noted that absolute convergence of at least one of X7 o4, 
and ©*_,5, is an essential condition for the validity of part (1) of Theorem 
5.2.17. If this condition is not satisfied, then Ly_,c, may not converge. For 


o0 


example, consider the series X? o4,» Xn -0 b,» where 


These two series are convergent by Theorem 5.2.14. They are not, however, 
absolutely convergent, and their Cauchy’s product is divergent (see Exercise 
5.22). 


5.3. SEQUENCES AND SERIES OF FUNCTIONS 
All the sequences and series considered thus far in this chapter had constant 
terms. We now extend our study to sequences and series whose terms are 


functions of x. 


Definition 5.3.1. Let {f,(x)}P_, be a sequence of functions defined on a 
set DCR. 
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1. If there exists a function f(x) defined on D such that for every x in D, 
lim f) =f), 


then the sequence {f,,(x)}”_, is said to converge to f(x) on D. Thus for 
a given e> 0 there exists an integer N such that |f,(x) —f(x)| <e if 
n >N. In general, N depends on e as well as on x. 

2. If _, f(x) converges for every x in D to s(x), then s(x) is said to be 
the sum of the series. In this case, for a given e> 0 there exists an 
integer N such that 


|s,(x)-—s(x)|<e ifn>N, 


where s,,(x) is the nth partial sum of the series X7; f,(x). The integer 
N depends on e€ and, in general, on x also. 

3. In particular, if N in (1) depends on e but not on x€ D, then the 
sequence {f,(x)}?_,; is said to converge uniformly to f(x) on D. 
Similarly, if N in (2) depends on e, but not on x € D, then the series 
E” f,(«) converges uniformly to s(x) on D. m 


The Cauchy criterion for sequences (Theorem 5.1.6) and its application to 
series (Theorem 5.2.1) apply to sequences and series of functions. In case of 
uniform convergence, the integer N described in this criterion depends only 
on e€. 


Theorem 5.3.1. Let {f (x); be a sequence of functions defined on 
D CR and converging to f(x). Define the number A,, as 


A, = up Aa) —f(x) |. 


Then the sequence converges uniformly to f(x) on D if and only if A,, > 0 as 
n>, 


Proof. Sufficiency: Suppose that A,,— 0 as n >œ. To show that f(x) > 
f(x) uniformly on D. Let e> 0 be given. Then there exists an integer N such 
that for n >N, A, < e€. Hence, for such values of n, 


A —f(*)| <A, <e 


for all x € D. Since N depends only on e, the sequence {f,(x)}*_, converges 
uniformly to f(x) on D. 

Necessity: Suppose that f(x) > f(x) uniformly on D. To show that à, > 0. 
Let e>0 be given. There exists an integer N that depends only on e such 
that for n >N, 


fax) -œ| <3 
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for all x € D. It follows that 


A, = sup [fal 2) -f0) |< 5- 


XED 


Thus A, >Q as n >œ. o 


Theorem 5.3.1 can be applied to convergent series of functions by replac- 
ing f, (x) and f(x) with s,(x) and s(x), respectively, where s,(x) is the nth 
partial sum of the series and s(x) is its sum. 


EXAMPLE 5.3.1. Let f,(%)=sinQax/n), O<x<1. Then f,(x)—>0 as 
n > ©, Furthermore, 


27x 27x 27 
sin < < 
n n n 
In this case, 
_ (27x 27 
A, = sup |sinj —— || < — 
O<x<1 n n 


Thus A, > 0 as n > œ, and the sequence {f (x); -1 converges uniformly to 
f(x) =0 on [0,1]. 


The next theorem provides a simple test for uniform convergence of series 
of functions. It is due to Karl Weierstrass (1815-1897). 


Theorem 5.3.2 (Weierstrass’s M-test). Let X71 f,(x) be a series of func- 
tions defined on D CR. If there exists a sequence {M,,}"_, of constants such 


that 


lf.(x)|<M,, n=1,2,,..., 


for all x € D, and if L*_,M,, converges, then L*_, f (x) converges uniformly 
on D. 


Proof. Let «> 0 be given. By the Cauchy criterion (Theorem 5.2.1), there 
exists an integer N such that 


n 
X M,<e 


i=m+1 
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for all n >m > N. Hence, for all such values of m, n, and for all x € D, 


L f(x)|s |fi(x)| 
i=m+1 i=m+1 
< }, M;<e. 
i=m+1 


This implies that X7_; f, (x) converges uniformly on D by the Cauchy 
criterion. o 


We note that Weierstrass’s M-test is easier to apply than Theorem 5.3.1, 
since it does not require specifying the sum s(x) of the series. 


EXAMPLE 5.3.2. Let us investigate convergence of the sequence {f (x) -1 
where f(x) is defined as 


This sequence converges to 


(Oe o 0<x<1, 


x>1. 
Now, 
1/n, 0<x< 1, 
f(x) —f(x)|=4( exp(x/n)-1, 1<x<2, 
1/n, x>2. 


However, for 1 <x < 2, 
exp(x/n) —1<exp(2/n) — 1. 


Furthermore, by Maclaurin’s series expansion, 


Ons 


cols) Er 
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Thus, 
“sup |A) F(x) |= exp(2/n) = 1, 


which tends to zero as n > ©. Therefore, the sequence {f,,(x)}"_, converges 
uniformly to f(x) on [0,~). 


EXAMPLE 5.3.3. Consider the series L*_, f (x), where 


n 


fil) = aa O0<x<l. 
The function f,(x) is monotone increasing with respect to x. It follows that 
for 0<x <1, 


fal x)|=fr(x) < Ham Bots, 


m+n’ 
But the series © _,[1/(n? + n)] is convergent. Hence, *_, f(x) is uniformly 
convergent on [0,1] by Weierstrass’s M-test. 


5.3.1. Properties of Uniformly Convergent Sequences and Series 


Sequences and series of functions that are uniformly convergent have several 
interesting properties. We shall study some of these properties in this section. 


Theorem 5.3.3. Let {f,(x)}"_, be uniformly convergent to f(x) on a set 
D. If for each n, f,(x) has a limit 7, as x >x,), where x, is a limit point of 
D, then the sequence {7,}°_, converges to To = lim A f(x). This is equiva- 
lent to stating that 


x>x 


lim | lim f,(x) 
n>% | x>xo 


= lim | tim A)|; 


Proof. Let us first show that {7,}°_, is a convergent sequence. By the 
Cauchy criterion (Theorem 5.1.6), there exists an integer N such that for a 
given e> 0, 


I fn() -f(X1< 5 for all m>N, n>N. (5.45) 


The integer N depends only on e, and inequality (5.45) is true for all x € D, 
since the sequence is uniformly convergent. By taking the limit as x > x, in 
(5.45) we get 


m n 


€ 
T Tl S 5 ifm>N,n>N, 
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which indicates that {7,}*_, is a Cauchy sequence and is therefore conver- 
gent. Let 7o = lim,,_,., 7,. We now need to show that f(x) has a limit and that 


this limit is equal to 7). Let e> 0 be given. There exists an integer N, such 
that for n > N4, 


If) -A)| É 


for all x€ D, by the uniform convergence of the sequence. Furthermore, 
there exists an integer N, such that 


|7,,— Tol <É if n >N,. 
Thus for n > max(N,, N3), 
IA) -AHIR Tl < $ 
for all x € D. Then 
E = Tols AORE) = m E= 70 | 
<f) = 1/45 (5.46) 


if n > max(N,, N,) for all x€ D. By taking the limit as x > x, in (5.46) we 
get 


(5.47) 


by the fact that 


lim |f,(x)-—7,|/=0  forn=1,2,.... 
XX 


Since e is arbitrarily small, inequality (5.47) implies that 


lim f(x) = Tọ. m 


xX>X9 


Corollary 5.3.1. Let {f (x); be a sequence of continuous functions 
that converges uniformly to f(x) on a set D. Then f(x) is continuous on D. 


Proof. The proof follows directly from Theorem 5.3.3, since 7, =f,,(%)) for 
n>1and m= lim, s 7% =lim, . faxo) =f(%p). m 
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Corollary 5.3.2. Let ©>_; f, (x) be a series of functions that converges 
uniformly to s(x) on a set D. If for each n, f,(x) has a limit 7, as x > x9, 
a the series L717, converges and has a sum equal to sọ = lim, ., ,, s(x), 
that is, 


lim f(x) = Elim f(x). 
n=1 *?*o 


X~X0 n=1 


Proof. The proof is left to the reader. o 


By combining Corollaries 5.3.1 and 5.3.2 we conclude the following corol- 
lary: 


Corollary 5.3.3. Let ©*_,f,(x) be a series of continuous functions that 
converges uniformly to s(x) on a set D. Then s(x) is continuous on D. 


EXAMPLE 5.3.4. Let f,(x) =x?/(1 +x*)""! be defined on [1, œ) for n > 1. 
Let s,(x) be the nth partial sum of the series © _, f(x). Then, 


1 
n x? = (1 +x2)" 
5,(X) x (1+ T =x? = 4-= ie ar ’ 
= xX 
a a 1+x? 


by using the fact that the sum of the finite geometric series L7_,a‘~! is 


1-a" 


yak t= (5.48) 
k=1 


Since 1/(1 +x?) <1 for x > 1, then as n> %, 


x 
s(x) > — = = 1 +z. 


Thus, 
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Now, let x, = 1, then 
= z ys > 1 1 : 
im — = at = 
nai *>1 (1+2x7)" Me ee ies 


= lim (1 +x°), 
x>1 
which results from applying formula (5.48) with a=4 and then letting 
n => ©, This provides a verification to Corollary 5.3.2. Note that the series 
Er fa(x) is uniformly convergent by Weierstrass’s M-test (why?). 


Corollaries 5.3.2 and 5.3.3 clearly show that the properties of the function 
f (x) carry over to the sum s(x) of the series L*_, f (x) when the series is 
uniformly convergent. 

Another property that s(x) shares with the f,(x)’s is given by the following 
theorem: 


Theorem 5.3.4. Let L*_, f(x) be a series of functions, where f(x) is 
differentiable on [a, b] for n > 1. Suppose that L*_, f (x) converges at least 
at one point x) <[a,b] and that ©" _, f/(x) converges uniformly on [a, b]. 
Then we have the following: 


1. L*_, f(x) converges uniformly to s(x) on [a, b]. 
2. s(x) =X5_, f/(x), that is, the derivative of s(x) is obtained by a 
term-by-term differentiation of the series L?_, f,(x). 


Proof. 


1. Let x #x, be a point in [a,b]. By the mean value theorem (Theorem 
4.2.2), there exists a point €, between x and x, such that for n > 1, 


Fax) = fal Xo) = (4% ~X0) fn 8n). (5.49) 


Since LF _, f(x) is uniformly convergent on [a, b], then by the Cauchy 
criterion, there exists an integer N such that 


E f(x) 


€ 
< —— 
i=m+1 b-a 


for all n >m >N and for any x € [a, b]. From (5.49) we get 


n 


E (Ax) -f(+0)1]= eas 


i=m+1 


E E) 


i=m+1 


< 


f ril 

R a fi 
b-a i 
<E 
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for all n >m >N and for any x € [a, b]. This shows that 
E (AC) fo) 
is uniformly convergent on D. Consequently, 
EAS E Us) -All +) 
is uniformly convergent to s(x) on D, where s(x,) is the sum of the 


series £7; f (xo), which was assumed to be convergent. 
2. Let ¢,(h) denote the ratio 


f(x +h) ~ fr) 
h 


$,(h) = ; n=1,2,..., 


where both x and x +h belong to [a,b]. By invoking the mean value 
theorem again, $,(/) can be written as 


$,(h) =f,(x+6,h), n=1,2,..., 


where 0<6,< 1. Furthermore, by the uniform convergence of 
L” f(x) we can deduce that £*_, (h) is also uniformly convergent 
on [—r,r] for some r > 0. But 


ie 2 f(x th) -f,(x« 
DA LE 


_ s(x +h) —s(x) 
h > 


(5.50) 


where s(x) is the sum of the series X%_; f (x). Let us now apply 
Corollary 5.3.2 to X7 -1 $,(h). We get 


dm D p,(h) =, D Jini h,(h). (5.51) 


From (5.50) and (5.51) we then have 
s(x+h)-s(x) 


h>0 h 


z ERGY: 


Thus, 


TE E fi). a 
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5.4. POWER SERIES 


A power series is a special case of the series of functions discussed in Section 
5.3. It is of the form Ly _)a,x”", where the a,’s are constants. We have 
already encountered such series in connection with Taylor’s and Maclaurin’s 
series in Section 4.3. 

Obviously, just as with any series of functions, the convergence of a power 
series depends on the values of x. By definition, if there exists a number 
p> 0 such that &*_,a,x" is convergent if |x| < p and is divergent if |x| > p, 
then p is said to be the radius of convergence of the series, and the interval 
(—p, p) is called the interval of convergence. The set of all values of x for 
which the power series converges is called its region of convergence. 

The definition of the radius of convergence implies that L¥_,)a,x” is 
absolutely convergent within its interval of convergence. This is shown in the 
next theorem. 


Theorem 5.4.1. Let p be the radius of convergence of X% _ a,x". Sup- 
pose that p>0. Then U7_,a,x” converges absolutely for all x inside the 
interval (— p, p). 


Proof. Let x be such that |x| < p. There exists a point x) € (—p, p) such 
that |x| <|x,|. Then, X7 oa„xọ is a convergent series. By Result 5.2.1, 
a,x > 0 as n>, and hence {a,x} -o is a bounded sequence by Theorem 
5.1.1. Thus 


la, xXg|<K  foralln. 


Now, 
n 
n x n 
la,x"| =la, es xo 
< Kn”, 
where 
x 
n=|—|<1 
Xo 


Since the geometric series X% ọn” is convergent, then by the comparison test 
(see Theorem 5.2.4), the series X7_ola„x”| is convergent. m 


To determine the radius of convergence we shall rely on some of the tests 
of convergence given in Section 5.2.1. 


Theorem 5.4.2. Let X;-ọ4,x” be a power series. Suppose that 


4 Anti 
lim 


noo 


a 


n 
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Then the radius of convergence of the power series is 


1/p, 0<p<%, 
p= 0, p=, 
©, p=0. 


Proof. The proof follows from applying the ratio test given in Theorem 
5.2.6 to the series LF _yla,.x”|: We have that if 


n+1 
A An+1* 
lim 


n> 


<1, 


a,x 


then X% _)4a,x" is absolutely convergent. This inequality can be written as 
plx| <1. (5.52) 


If 0<p<~, then absolute convergence occurs if |x| <1/p and the series 
diverges when |x| >1/p. Thus p=1/p. If p =~%, the series diverges when- 
ever x #0. In this case, p= 0. If p =0, then (5.52) holds for any value of x, 
that is, p=. Oo 


Theorem 5.4.3. Let L)_)a,x" be a power series. Suppose that 


lim sup la,” =q. 


n—> 


Then, 


1/4, 0<q<%, 
p= į 0, q=%, 
©, q=0. 


Proof. This result follows from applying the root test in Theorem 5.2.8 to 
the series L%_,la,x"|. Details of the proof are similar to those given in 
Theorem 5.4.2. o 


The determination of the region of convergence of Xy _ọa,x” depends on 
the value of p. We know that the series converges if |x| < p and diverges if 
|x| >p. The convergence of the series at x=p and x= —p has to be 
determined separately. Thus the region of convergence can be (—p, p), 
[—p, p), (=p, pl, or [—p, p]. 


EXAMPLE 5.4.1. Consider the geometric series U7 _, x”. By applying either 
Theorem 5.4.2 or Theorem 5.4.3, it is easy to show that p= 1. The series 
diverges if x =1 or —1. Thus the region of convergence is (— 1,1). The sum 
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of this series can be obtained from formula (5.48) by letting n go to infinity. 
Thus 


ye = -1<x<1. (5.53) 


EXAMPLE 5.4.2. Consider the series 1% _)(x"/n!). Here, 


a n! 
lim | +] = lim ——— 
n>o] A, n>% (n+ 1)! 
1 
= lim =0. 
n>» n+1 


Thus p=, and the series converges absolutely for any value of x. This 
particular series is Maclaurin’s expansion of e*, that is, 


m 


ajs 


EXAMPLE 5.4.3. Suppose we have the series X% _,(x"/n). Then 


. n 
= lim =], 
a, noontl 


lim 


noo 


and p= 1. When x = 1 we get the harmonic series, which is divergent. When 
x= —1 we get the alternating harmonic series, which is convergent by 
Theorem 5.2.14. Thus the region of convergence is [—1, 1). 


In addition to being absolutely convergent within its interval of conver- 
gence, a power series is also uniformly convergent there. This is shown in the 
next theorem. 


Theorem 5.4.4. Let © _,)a,,x” be a power series with a radius of conver- 
gence p (> 0). Then we have the following: 


1. The series converges uniformly on the interval [—r,r], where r < p. 
2. If s(x) =L%_,a,x", then s(x) (i) is continuous on [-r,r]; Gi) is 
differentiable on [—r,r] and has derivative 
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and (iii) has derivatives of all orders on [—r,r] and 


Proof. 


1. If |x| <r, then |a,x"| <|a,|r” for n>0. Since L*_,la,|r” is conver- 
gent by Theorem 5.4.1, then by the Weierstrass M-test (Theorem 5.3.2), 
YL -04,x”" is uniformly convergent on [—r,r]. 

2. (i) Continuity of s(x) follows directly from Corollary 5.3.3. Gi) To show 
this result, we first note that the two series £7 _ọa„x” and L*_,na,x"~! 
have the same radius of convergence. This is true by Theorem 5.4.3 and 
the fact that 


lim sup Ina, |'/” = lim sup lah, 


noo n> 


1 1 


since lim, _,,.2'/" =1 as n > œ. We can then assert that D?_,na,x"~ 
is uniformly convergent on [—r,r]. By Theorem 5.3.4, s(x) is differen- 
tiable on [—r,r], and its derivative is obtained by a term-by-term 
differentiation of L*_) a,x”. (iii) This follows from part (ii) by repeated 
differentiation of s(x). Oo 


Under a certain condition, the interval on which the power series con- 
verges uniformly can include the end points of the interval of convergence. 
This is discussed in the next theorem. 


Theorem 5.4.5. Let X% oax” be a power series with a finite nonzero 
radius of convergence p. If Ly_,a, p” is absolutely convergent, then the 
power series is uniformly convergent on [ — p, p]. 


Proof. The proof is similar to that of part 1 of Theorem 5.4.4. In this case, 
for |x| < p,|a,x"| < |a„lp”. Since X*_,la,|p” is convergent, then L*_,a,x” 
is uniformly convergent on [— p, p] by the Weierstrass M-test. Oo 


EXAMPLE 5.4.4. Consider the geometric series of Example 5.4.1. This 
series is uniformly convergent on [—r,r], where r<1. Furthermore, by 
differentiating the two sides of (5.53) we get 


Ee 1 
L ax! = —, -1<x<1. 
eal (1 —x)? 
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This provides a series expansion of 1/(1 — x)” within the interval (— 1,1). By 
repeated differentiation it is easy to show that for —1<x< 1, 


1 = = 

-e e Ne k=1,2,.... 

( = x) n=0 n 

The radius of convergence of this series is p = 1, the same as for the original 
series. 


EXAMPLE 5.4.5. Suppose we have the series 


which can be written as 


oo n 


3 


i nw +n’ 
where z=2x/(1—x). This is a power series in z. By Theorem 5.4.2, the 
radius of convergence of this series is p= 1. We note that when z= 1 the 
series L*_,[1/(2n? +n)] is absolutely convergent. Thus by Theorem 5.4.5, 
the given series is uniformly convergent for |z| <1, that is, for values of x 
satisfying 


or equivalently, 


5.5. SEQUENCES AND SERIES OF MATRICES 


In Section 5.3 we considered sequences and series whose terms were scalar 
functions of x rather than being constant as was done in Sections 5.1 and 5.2. 
In this section we consider yet another extension, in which the terms of the 
series are matrices rather than scalars. We shall provide a brief discussion of 
this extension. The interested reader can find a more detailed study of this 
topic in Gantmacher (1959), Lancaster (1969), and Graybill (1983). As in 
Chapter 2, all matrix elements considered here are real. 

For the purpose of our study of sequences and series of matrices we first 
need to introduce the norm of a matrix. 
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Definition 5.5.1. Let A be a matrix of order m Xn. A norm of A, denoted 
by ||All, is a real-valued function of A with the following properties: 


. ||Al]= 0, and ||A||= 0 if and only if A = 0. 

. |lcAll = lc| All, where c is a scalar. 

. || A+ B]| <||/Al]+||Bll, where B is any matrix of order m Xn. 

. || AC] < ||All||Cl], where C is any matrix for which the product AC is 
defined. Oo 


bh wn = 


If A = (a;;), then examples of matrix norms that satisfy properties 1, 2, 3, 
and 4 include the following: 


1. The Euclidean norm, || All, = (072, L7_,4;,)'”. 
2. The spectral norm, || All; = [ena (A'A)]'Z, where ema(A'A) is the largest 
eigenvalue of A’A. 


Definition 5.5.2. Let A, = (a,j) be matrices of orders m Xn for k>1. 
The sequence {A,};_, is said to converge to the m Xn matrix A = (a;;) if 
lim; o a;i = Gj; for i= 1,2,...,m; JH 1,2,...,7. o 


For example, the sequence of matrices 


1 1 k-1 
k k k+1 
A= k2 ; k =1,2,,..., 
2 ——— 
2—k? 
converges to 
_{0 -1 1 
a=[ 0 || 
as k > œ. The sequence 
: k?-2 
Z = 
A,= k 9 k=1,2, >, 
1+k 


does not converge, since k? — 2 goes to infinite as k > . 
3 
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o 


From Definition 5.5.2 it is easy to see that {A,};_, converges to A if and 
only if 


lim ||A, — All=0, 
k>ow 


where ||-|| is any matrix norm. 


Definition 5.5.3. Let {A,}_, be a sequence of matrices of order m Xn. 
Then /7_,A, is called an infinite series (or just a series) of matrices. This 
series is said to converge to the m Xn matrix S = (s;;) if and only if the series 
eH 1 ijk converges for all i=1,2,...,m; j=1,2,...,n, where a;,, is the 
(i, j)th element of A,, and 


È Gin = Sij i=1,2,...,m; j=1,2,...,n. (5.54) 


The series X; A, is divergent if at least one of the series in (5.54) is 
divergent. m 


From Definition 5.5.3 and Result 5.2.1 we conclude that X}-;1 A, diverges 
if lim, _,.. 4;; #0 for at least one pair (i, j), that is, if lim, ,„ A, 0. 

A particular type of infinite series of matrices is the power series 
E? _o a, AX, where A is a square matrix, a, is a scalar (k =0,1,...), and A? is 
by definition the identity matrix I. For example, the power series 


represents an expansion of the exponential matrix function exp(A) (see 
Gantmacher, 1959). 


Theorem 5.5.1. Let A be an n Xn matrix. Then lim, _,,, AX = 0 if ||Al| < 1, 
where ||-|| is any matrix norm. 


Proof. From property 4 in Definition 5.5.1 we can write 
JA‘ <All‘,  k=1,2,.... 


Since ||A|| < 1, then lim, „ „l| A*|| = 0, which implies that lim, _, ,, A* = 0 (why?). 
Oo 


Theorem 5.5.2. Let A be a symmetric matrix of order n Xn such that 
Jà] <1 for i=1,2,...,n, where A; is the ith eigenvalue of A (all the 
eigenvalues of A are real by Theorem 2.3.5). Then LZ_,A* converges to 
(I—A)-'. 
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Proof. By the spectral decomposition theorem (Theorem 2.3.10) there 
exists an orthogonal matrix P such that A= PAP’, where A is a diagonal 
matrix whose diagonal elements are the eigenvalues of A. Then 


AK = PAFP’, k=0,1,2,.... 


Since |A,;| <1 for all i, then A* > 0 and hence A‘ > 0 as k > œ. Further- 
more, the matrix I — A is nonsingular, since 


I-A=P(I-A)P’ 


and all the diagonal elements of I— A are positive. 
Now, for any nonnegative integer k we have the following identity: 


(I-A)(I+A+8£ +- +A*) I-A, 
Hence, 
I+A+ +A‘ = (I-A) (I-A). 


By letting k go to infinity we get 


5 A= (1-a), 


since lim, „„ AX*! =0. o 


Theorem 5.5.3. Let A be a symmetric nXn matrix and A be any 
eigenvalue of A. Then |A| <||A|l, where ||A|| is any matrix norm of A. 


Proof. We have that Av = Av, where v is an eigenvector of A for the 
eigenvalue A. If ||A]| is any matrix norm of A, then 


| Avi] = |Alllvll = || Avil < IIA ilv]. 
Since v + 0, we conclude that 


lal <|lAll. m 


Corollary 5.5.1. Let A be a symmetric matrix of order n Xn such that 
| All<1, where ||Al| is any matrix norm of A. Then ¥7_,A* converges to 
(A -A)!. 


Proof. This result follows from Theorem 5.5.2, since for i=1,2,...,n, 
|A;| <All <1. o 
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5.6. APPLICATIONS IN STATISTICS 


Sequences and series have many useful applications in statistics. Some of 
these applications will be discussed in this section. 


5.6.1. Moments of a Discrete Distribution 


Perhaps one of the most visible applications of infinite series in statistics is in 
the study of the distribution of a discrete random variable that can assume a 
countable number of values. Under certain conditions, this distribution can 
be completely determined by its moments. By definition, the moments of a 
distribution are a set of descriptive constants that are useful for measuring its 


properties. 
Let X be a discrete random variable that takes on the values 
Xos Xq,+++>X,)+++, With probabilities p(n), n = 0. Then, by definition, the kth 


central moment of X, denoted by p,, is 
k = k 
py, =E|(X— n) |= =H) p(n), k=1,2,..., 
n=0 


where w = E(X) = X*_, x, p(n) is the mean of X. We note that m, =° is 
the variance of X. The kth noncentral moment of X is given by the series 


wi, =E(X*)= X xžp(n),  k=1,2,.... (5.55) 


n=0 


We note that w = u. If, for some integer N, |x,| > 1 for n >N, and if the 
series in (5.55) converges absolutely, then so does the series for w, (j = 
1,2,..., k — 1). This follows from applying the comparison test: 


Ix, l/p(n) < |x, |“p(n) if j <k and n>N. 


Examples of discrete random variables with a countable number of values 
include the Poisson (see Section 4.5.3) and the negative binomial. The latter 
random variable represents the number n of failures before the rth success 
when independent trials are performed, each of which has two probability 
outcomes, success or failure, with a constant probability p of success on each 
trial. Its probability mass function is therefore of the form 


p(n) = ("trl ra- n=0,1,2,.... 
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By contrast, the Poisson random variable has the probability mass function 


ey" 


n! 


p(n) = Í n=0,1,2,..., 


where A is the mean of X. We can verify that à is the mean by writing 


œ ey” 
H= 1 
n=0 the 
oœ ArT! 
=e 
n=1 (n = 1)! 
=)e~* 2 we 
n=0 n! 
= \e™>ò(eò), by Maclaurin’s expansion of e^ 
=À. 


The second noncentral moment of the Poisson distribution is 


, o EA" 
H2 a n! 
œ n-1 
=e} n 
n=1 (Ges 1)! 


n-1 


-eA F (n-1+1) 


n=1 (n-1)! 
o0 A22 o0 ArT! 
=e Aa} — + YY —— 
n=2 (n=2)! n=1 (n=1)! 


=e Af Ae*+e*] 
=X +X. 


In general, the kth noncentral moment of the Poisson distribution is given by 
the series 
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which converges for any k. This can be shown, for example, by the ratio test 


n+1\* A 
lim = tim ( 
n>% A, n> n n+1 
=0<1. 


Thus all the noncentral moments of the Poisson distribution exist. 
Similarly, for the negative binomial distribution we have 


n=0 n 
r(1— 
= ard, (why?), (5.56) 
' z +r-1)\_, n 
py = Yn" A Jra- 
n=0 
r(l-p)(14+r- 
AA ag, PE why?) (5.57) 
and the kth noncentral moment, 
w= X ("t a-p, k= iD (5.58) 


n=0 


exists for any k, since, by the ratio test, 


k 
(= ea 
n+1 
li eel jj n ee 
ae a, ase (ee ( P) 
n 
i j (Z q 
= pim n —| 


=1-p<1l, 


which proves convergence of the series in (5.58). 
A very important inequality that concerns the mean u and variance o? of 
any random variable X (not just the discrete ones) is Chebyshev’s inequality, 
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namely, 


or equivalently, 
1 
P(|X— ul <ro) > 1 ==, (5.59) 
r 


where r is any positive number (see, for example, Lindgren, 1976, Section 
2.3.2). The importance of this inequality stems from the fact that it is 
independent of the exact distribution of X and connects the variance of X 
with the distribution of its values. For example, inequality (5.59) states that at 
least (1 — 1 /v’) x 100% of the values of X fall within rø from its mean, 
where o = Vo? is the standard deviation of X. 

Chebyshev’s inequality is a special case of a more general inequality called 
Markov’s inequality. If b is a nonzero constant and h(x) is a nonnegative 
function, then 


P[n(X) >=b?] < FERO], 


provided that E[hCX)] exists. Chebyshev’s inequality follows from Markov’s 
inequality by choosing A(X) = (X — u’. 

Another important result that concerns the moments of a distribution is 
given by the following theorem, regarding what is known as the Stieltjes 
moment problem, which also applies to any random variable: 


Theorem 5.6.1. Suppose that the moments u, (k =1,2,...) of a random 
variable X exist, and the series 


L — r (5.60) 


is absolutely convergent for some r>0. Then these moments uniquely 
determine the cumulative distribution function F(x) of X. 


Proof. See, for example, Fisz (1963, Theorem 3.2.1). o 


In particular, if 
|u| <M*, k=1,2,..., 


for some constant M, then the series in (5.60) converges absolutely for any 
T> 0 by the comparison test. This is true because the series L7_,(M*/k!)r* 
converges (for example, by the ratio test) for any value of r. 
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It should be noted that absolute convergence of the series in (5.60) is a 
sufficient condition for the unique determination of F(x), but is not a 
necessary condition. This is shown in Rao (1973, page 106). Furthermore, 
if some moments of X fail to exist, then the remaining moments that do exist 
cannot determine F(x) uniquely. The following counterexample is given in 
Fisz (1963, page 74): 

Let X be a discrete random variable that takes on the values x„ =2"/n’, 
n > 1, with probabilities p(n) =1/2", n > 1. Then 

eS 
u=E(X)= X z> 


n=1 


which exists, because the series is convergent. However, w, does not exist, 
because 


and this series is divergent, since 2”/n* > % as n > %, 

Now, let Y be another discrete random variable that takes on the value 
zero with probability 5 and the values y, =2”+!/n?, n > 1, with probabilities 
q(n)=1/2"+!, n > 1. Then, 

1 
y2 


-1 7 E(X). 


E(Y)= bh 
n=1 
The second noncentral moment of Y does not exist, since 


œ gnel 


My =E(Y’) = = nt” 


n=1 


and this series is divergent. 

Since w, does not exist for both X and Y, none of their noncentral 
moments of order k>2 exist either, as can be seen from applying the 
comparison test. Thus X and Y have the same first noncentral moments, but 
do not have noncentral moments of any order greater than 1. These two 
random variables have obviously different distributions. 


5.6.2. Moment and Probability Generating Functions 

Let X be a discrete random variable that takes on the values x9, X1, X,,... 
with probabilities p(n), n > 0. 

The Moment Generating Function of X 


This function is defined as 


o0 


p(t) =E(e™) = X e™p(n) (5.61) 


n=0 
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provided that the series converges. In particular, if x, =n for n > 0, then 
b(t) = È e"p(n), (5.62) 
n=0 


which is a power series in e'. If p is the radius of convergence for this series, 
then by Theorem 5.4.4, (t) is a continuous function of t and has derivatives 
of all orders inside its interval of convergence. Since 


d'p(t) 
aE | TES KEL (5.63) 


p(t), when it exists, can be used to obtain all noncentral moments of X, 
which can completely determine the distribution of X by Theorem 5.6.1. 

From (5.63), by using Maclaurin’s expansion of ¢(t), we can obtain an 
expression for this function as a power series in t: 


$) = 400) + E 9) 


=1+ 5 Zr. (5.64) 


then by Theorem 5.4.2, the radius of convergence p is 


(1/p, 0<p<%, 
p= 0, p=, 
o, p=0. 


Alternatively, if lim sup, „d p(a)]'/" =q, then 
1/4, 0<q<™%, 


p> 0, q=%, 
9, q=0. 
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For example, for the Poisson distribution, where 
en” 


n! 


p(n) = > n=0,1,2,..., 
we have lim, _, p(n + 1)/p(n)] = lim, , .JA/(m + D] = 0. Hence, p = ~, that 
is, 

o0 ey" 


H(t) = E —e 


n=0 


tn 


converges uniformly for any value of t for which e' < œ, that is, —% < t < %. 
As a matter of fact, a closed-form expression for (t) can be found, since 


NO oy a 


n=0 


=e *exp( Ae‘) 
= exp( Ae’ — A) for all ft. (5.65) 
The kth noncentral moment of X is then given by 


_ de(t) d*( Ae‘ — A) 
Be de® eno dt* : 


Il 
(=; 


In particular, the first two noncentral moments are 
W =b=A, 
WAtA. 
This confirms our earlier finding concerning these two moments. 
It should be noted that formula (5.63) is valid provided that there exists a 


ô> 0 such that the neighborhood N;(0) is contained inside the interval of 
convergence. For example, let X have the probability mass function 


6 
= 5 =1,2,.... 
p(n) Fn n 


Then 


Hence, by Theorem 5.4.4, the series 


oœ 


6 
e= =e 


n=1 
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converges uniformly for values of t satisfying e' <r, where r <1, or equiva- 
lently, for t<logr <0. If, however, t>0, then the series diverges. Thus 
there does not exist a neighborhood N,(0) that is contained inside the 
interval of convergence for any 6 > 0. Consequently, formula (5.63) does not 
hold in this case. 

From the moment generating function we can derive a series of constants 
that play a role similar to that of the moments. These constants are called 
cumulants. They have properties that are, in certain circumstances, more 
useful than those of the moments. Cumulants were originally defined and 
studied by Thiele (1903). 

By definition, the cumulants of X, denoted by k}, k5,...,«,,... are 
constants that satisfy the following identity in t: 


or Kar 
exp| kıt + —— ++ +e 
2! n! 
H: H, 
= , E Dito oe EE ET 
1+ pitt are + Par a ae (5.66) 


Using formula (5.64), this identity can be written as 


o0 


2 Lpr = log (t), (5.67) 


n=1 


provided that (1) exists and is positive. By definition, the natural logarithm 
of the moment generating function of X is called the cumulant generating 
function. 

Formula (5.66) can be used to express the noncentral moments in terms of 
the cumulants, and vice versa. Kendall and Stuart (1977, Section 3.14) give a 
general relationship that can be used for this purpose. For example, 


K= Wi» 

f 12 
K= Mim Mis 

1 1 1 13 
K3 = M3 — 3u My +2 ui. 


The cumulants have an interesting property in that they are, except for «K4, 
invariant to any constant shift c in X. That is, for n =2,3,...,«, is not 
changed if X is replaced by X + c. This follows from noting that 


n 


Ee] = 0°%(1), 
which is the moment generating function of X +c. But 


log| e“o(t) | =ct + log f(t). 
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By comparison with (5.67) we can then conclude that except for «,, the 
cumulants of X +c are the same as those of X. This contrasts sharply with 
the noncentral moments of X, which are not invariant to such a shift. 

Another advantage of using cumulants is that they can be employed to 
obtain approximate expressions for the percentile points of the distribution 
of X (see Section 9.5.1). 


EXAMPLE 5.6.1. Let X be a Poisson random variable whose moment 
generating function is given by formula (5.65). By applying (5.67) we get 


o0 


L Zr = log| àe'— d)| 
nl = log| exp( Ae ) 


n=1 


ag 
n!’ 


AL 

n=1 
Here, we have made use of Maclaurin’s expansion of e'. This series converges 
for any value of t. It follows that «x, = A for n=1,2,.... 


The Probability Generating Function 


This is similar to the moment generating function. It is defined as 


w(t) =E(1*) = FE ep(n). 


n=0 


In particular, if x, =n for n > 0, then 
w(t)= È t"p(n), (5.68) 
n=0 


which is a power series in t. Within its interval of convergence, this series 
represents a continuous function with derivatives of all orders. We note that 
(0) = p(O) and that 


1 d*p(t) 
k! dtt |- 


=p(k), k=1,2,.... (5.69) 
0 


Thus, the entire probability distribution of X is completely determined by 
w(t). 


The probability generating function is also useful in determining the 
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moments of X. This is accomplished by using the relation 


a = E n(n-1)=(n-k+ 1) p(n) 
dt t=1 n=k 
=E[X(X- 1) (X-k+1)]. (5.70) 


The quantity on the right-hand side of (5.70) is called the kth factorial 
moment of X, which we denote by 6,. The noncentral moments of X can be 
derived from the 0,’s. For example, 


M= 91, 

My = 0 + 9), 

u3 = 0; +30, + 04, 

u4 = 0, +60; + 70, + 0. 
Obviously, formula (5.70) is valid provided that t= 1 belongs to the interval 
of convergence of the series in (5.68). 

If a closed-form expression is available for the moment generating func- 

tion, then a corresponding expression can be obtained for y(t) by replacing 


e' with t. For example, from formula (5.65), the probability generating 
function for the Poisson distribution is given by w(t) = exp(Ar — A). 


5.6.3. Some Limit Theorems 


In Section 3.7 we defined convergence in probability of a sequence of 
random variables. In Section 4.5.1 convergence in distribution of the same 
sequence was introduced. In this section we introduce yet another type of 
convergence. 


Definition 5.6.1. A sequence {X,}*_, of random variables converges in 
quadratic mean to a random variable X if 


lim E(X,- X) =0. 


This convergence is written symbolically as X, x, o 
Convergence in quadratic mean implies convergence in probability. This 
m. 
follows directly from applying Markov’s inequality: If X, =x , then for any 
e>0, 
1 2 
P(X, -X| > €) < =E(X, -X) >0 
€ 


as n — œ. This shows that the sequence {X,}°_, converges in probability 
to X. 
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5.6.3.1. The Weak Law of Large Numbers (Khinchine’s Theorem) 


Let {X;};_, be a sequence of independent and identically distributed random 
variables with a finite mean u. Then X,, converges in probability to u as 
n >œ, where X, =(1/n)X/_,X; is the sample mean of a sample of size n. 

Proof. See, for example, Lindgren (1976, Section 2.5.1) or Rao (1973, 
Section 2c.3). o 


Definition 5.6.2. A sequence {X,}*_, of random variables converges 
strongly, or almost surely, to a random variable X, written symbolically as 
X, 25, X, if for any e> 0, 


lim il sup |X, -X| >e) =0. Oo 
Nie n>N 


Theorem 5.6.2. Let {X,,}"_, be a sequence of random variables. Then we 
have the following: 


1. If X, pits c, where c is constant, then X,, converges in probability to c. 
m. 
2. If X,, e and the series L*_, E(X, — c} converges, then X, as, o 


5.6.3.2. The Strong Law of Large Numbers (Kolmogorov’s Theorem) 


Let {X,}°_, be a sequence of independent random variables such that 
E(X,,) =m, and Var(X,) = o, n=1,2,.... If the series E710; /n° con- 
verges, then X, #5 7i,, where m, =(1/n)D"_, M; 


Proof. See Rao (1973, Section 2c.3). oO 


5.6.3.3. The Continuity Theorem for Probability Generating Functions 


See Feller (1968, page 280). 

Suppose that for every k > 1, the sequence {p,(n)}"_) represents a dis- 
crete probability distribution. Let y(t) = X*_,t"p,(n) be the corresponding 
probability generating function (k = 1,2,...). In order for a limit 


dn = lim p,(n) 
to exist for every n = 0,1,..., it is necessary and sufficient that the limit 


w(t) = am p(t) 
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exist for every ¢ in the open interval (0, 1). In this case, 
P) = X ta. 
n=0 


This theorem implies that a sequence of discrete probability distributions 
converges if and only if the corresponding probability generating functions 
converge. It is important here to point out that the q,’s may not form a 
discrete probability distribution (because they may not sum to 1). The 
function y(t) may not therefore be a probability generating function. 


5.6.4. Power Series and Logarithmic Series Distributions 


The power series distribution, which was introduced by Kosambi (1949), 
represents a family of discrete distributions, such as the binomial, Poisson, 
and negative binomial. Its probability mass function is given by 


(n) L 0,1,2 
p(n) = ——, n=0,1,2,..., 
f(9) 
where a, >0, 6>0, and f(@) is the function 
f(0) = £ a,0”. (5.71) 
n=0 


This function is defined provided that 0 falls inside the interval of conver- 
gence of the series in (5.71). 

For example, for the Poisson distribution, 0 = A, where A is the mean, 
a, =1/n! for n=0,1,2,..., and f(@)=e*. For the negative binomial, 6 = 
1—p and a,= jee , n=0,1,2,..., where n=number of failures, 
r = number of successes, and p = probability of success on each trial, and 
thus 


a 
n 


o0 5 1 
KORDI Ja-) => 


n=0 


A special case of the power series distribution is the logarithmic series 
distribution. It was first introduced by Fisher, Corbet, and Williams (1943) 
while studying abundance and diversity for insect trap data. The probability 
mass function for this distribution is 


0” 
nlog(1— 0)’ 


p(n) n=1,2,..., 


where 0< 0<1. 
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The logarithmic series distribution is useful in the analysis of various kinds 
of data. A description of some of its applications can be found, for example, 
in Johnson and Kotz (1969, Chapter 7). 


5.6.5. Poisson Approximation to Power Series Distributions 


See Pérez-Abreu (1991). 

The Poisson distribution can provide an approximation to the distribution 
of the sum of random variables having power series distributions. This is 
based on the following theorem: 


Theorem 5.6.3. For each k>1, let X,, X,,..., X, be independent non- 


negative integer-valued random variables with a common power series distri- 
bution 


p(n) =a, 07/f(9,),  n=0,1,..., 


where a, > 0 (n =0,1,...) are independent of k and 


f(%) = È a, 6, 6, >0. 
n=0 


Let a) > 0, A>0 be fixed and S, = Li_,X;. If k0, > A as k > ~&, then 


lim P(S, =n) =e" AR /n}, n=0,1,..., 
know 


where Ay = Ad, /ap. 


Proof. See Pérez-Abreu (1991, page 43). Oo 


By using this theorem we can obtain the well-known Poisson approxima- 
tion to the binomial and the negative binomial distributions as shown below. 


EXAMPLE 5.6.2 (The Binomial Distribution). For each k> 1, let 
X\,...,X;, be a sequence of independent Bernoulli random variables with 


success probability p,. Let S, = Li_,X;. Suppose that kp, > A>0 as k > œ. 
Then, for each n = 0,1,..., 


. í k k-n 

lim P(S,=n) = | (1 — 

jim P(S, =n) sim (* |px Pr) 
=e" /n}. 


This follows from the fact that the Bernoulli distribution with success 
probability p, is a power series distribution with 6, =p,/( — p,) and f(@,) = 
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1+ 6,. Since ay)=a,=1, and k6,>A as k>~, we get from applying 
Theorem 5.6.3 that 


lim P(S,=n) =e™\"/n!. 
k>% 


EXAMPLE 5.6.3 (The Negative Binomial Distribution). We recall that a 
random variable Y has the negative binomial distribution if it represents the 
number of failures n (in repeated trials) before the kth success (k > 1). Let 
px denote the probability of success on a single trial. Let X4, X,,..., X, be 
random variables defined as 


X, = number of failures occurring before the 1st success, 


X, = number of failures occurring between the 1st success 
and the 2nd success, 


X, = number of failures occurring between the (k — 1)st 
success and the kth success. 


Such random variables have what is known as the geometric distribution. It is 
a special case of the negative binomial distribution with k = 1. The common 
probability distribution of the X;’s is 


P(X;=n) =p; -p,),  n=0,1...; i=1,2,...,k. 
This is a power series distribution with a, = 1 (n =0,1,...), 0,=1-—p,, and 


1 
L(L=pey Pk 


f(,) = E (1-p)"= 


It is easy to see that X,,X,,...,X, are independent and that Y= S, = 
Dh X;. 

Let us now assume that k(1 —p,) > à> 0 as k >~. Then from Theorem 
5.6.3 we obtain the following result: 


lim P(S =n) =e™°\"/n!, n=0,1..... 
kom 


5.6.6. A Ridge Regression Application 


Consider the linear model 


y=XBre, 
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where y is a vector of n response values, X is an n X p matrix of rank p, B is 
a vector of p unknown parameters, and e is a random error vector such that 
E(€)=0 and Var(e) = 7I,,. All variables in this model are corrected for 
their means and scaled to unit length, so that X’X and X’y are in correlation 
form. 

We recall from Section 2.4.2 that if the columns of X are multicollinear, 
then the least-squares estimator of B, namely B = (X’X)~'X’y, is an unreli- 
able estimator due to large variances associated with its elements. There are 
several methods that can be used to combat multicollinearity. A review of 
such methods can be found in Ofir and Khuri (1986). Ridge regression is one 
of the most popular of these methods. This method, which was developed by 
Hoerl and Kennard (1970a, b), is based on adding a positive constant k to the 
diagonal elements of X'X. This leads to a biased estimator B* of B called the 
ridge regression estimator and is given by 


B*=(X'X +k1,) X'y. 


The elements of B* can have substantially smaller variances than the corre- 
sponding elements of B (see, for example, Montgomery and Peck, 1982, 
Section 8.5.3). 

Draper and Herzberg (1987) showed that the ridge regression residual 
sum of squares can be represented as a power series in k. More specifically, 
consider the vector of predicted responses, 


9, = Xp* 
=X(X'X +k1,) 'X'y, (5.72) 
which is based on using B*. Formula (5.72) can be written as 
= Ga aT 
§, = X(X'X) ||, +k(X’X) |] X'y. (5.73) 


From Theorem 5.5.2, if all the eigenvalues of k(X'X)~' are less than one in 
absolute value, then 


l, + A(X"x) |] = ECD. (5.74) 
i=0 


From (5.73) and (5.74) we get 
9, = (H, — kH, +k°H, — k°H, + + )y, 


where H; = X(X'X)"'X’, i21. Thus the ridge regression residual sum of 
squares, which is the sum of squares of deviations of the elements of y from 
the corresponding elements of ¥,, is 


(y -9 (y -¥) =y'Qy, 
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where 
i 2 3 2 
Q= (I, —H, + kH, —k?H,+k°H,— ---). 


It can be shown (see Exercise 5.32) that 


y'Qy=SS,+ X (i-2)(-k)''S,, (5.75) 
i=3 


where SS, is the usual least-squares residual sum of squares, which can be 
obtained when k = 0, that is, 


SS, =y'[I, — X(X'X)'x’y, 


and S,=y'H,y, i> 3. The terms to the right of (5.75), other than SS,, are 
bias sums of squares induced by the presence of a nonzero k. Draper and 
Herzberg (1987) demonstrated by means of an example that the series in 
(5.75) may diverge or else converge very slowly, depending on the value of k. 
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EXERCISES 
In Mathematics 


5.1. Suppose that {a,}"_, is a bounded sequence of positive terms. 
(a) Define b, = max{a,,a,,...,a,}, n=1,2,.... Show that the se- 
quence {b,,}"_, converges, and identify its limit. 
(b) Suppose further that a, —>c as n>, where c>0. Show that 
C, >c, where {c,} -1 is the sequence of geometric means, c, = 


Ura)", 


5.2. Suppose that {a,}°_, and {b,}*_, are any two Cauchy sequences. Let 
d, =\|a, —6,|, n=1,2,.... Show that the sequence {d ,} -; converges. 


5.3. Prove Theorem 5.1.3. 


5.4. Show that if {a,}°_, is a bounded sequence, then the set E of all its 
subsequential limits is also bounded. 


5.5. Suppose that a, >c as n > œ% and that {a,}7_, is a sequence of positive 
terms for which Y_,a,;> © as n > &, 
(a) Show that 


Xi- G4; 
ee as n> o, 
Piero; 
In particular, if a; = 1 for all i, then 
1 n 
—} a;>c asn>~, 
n izi 


(b) Show that the converse of the special case in (a) does not always 
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5.6. 


5.7. 


5.8. 


5.9. 


5.12. 
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hold by giving a counterexample of a sequence {a,}"_, that does 
not converge, yet (7_,a;)/n converges as n > ©. 


Let {a,}"_, be a sequence of positive terms such that 


Qn+1 


>b as n > %, 


an 


where 0 <b <1. Show that there exist constants c and r such that 
0<r< 1 and c> 0 for which a, < cr” for sufficiently large values of n. 


Suppose that we have the sequence {a,}"_,, where a, = 1 and 


_ a,(3b + a7) posh as 
Anyi = 3a? +b ’ y TAS Ay Bee ae 


Show that the sequence converges, and find its limit. 


Show that the sequence {a,}7_, converges, and find its limit, where 
a,=1 and 


a1 =(2+a,) 2, n=1,2,.... 


Let {a,,},_, be a sequence and s, = ÈX; 14; 
(a) Show that lim sup, _,.(s,/n) < lim sup, ,.. 4,- 
(b) If s,,/n converges as n > ~, then show that a,/n > 0 as n > œ. 


. Show that the sequence {a,}"_,, where a, = £7_,(./i), is not a Cauchy 


sequence and is therefore divergent. 


. Suppose that the sequence {a,}"_, satisfies the following condition: 


There is an r, 0 <r < 1, such that 

dpa eG | <br”, n=1,2,..., 
where b is a positive constant. Show that this sequence converges. 
Show that if a, >0 for all n, then L>_,a, converges if and only if 


{s,},-, is a bounded sequence, where s„ is the nth partial sum of the 
series. 


. Show that the series £*_,[1/@Gn — 1)Gn + 2)] converges to r 
. Show that the series X}_,(n™” — 1)? is divergent for p < 1. 


. Let L*_,a,, be a divergent series of positive terms. 


(a) If {a,}"_, is a bounded sequence, then show that ©? _,[a,/( +a,,)] 
diverges. 
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5.16. Let ©” _,a,, be a divergent series of positive terms. Show that 
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(b) Show that (a) is true even if {a,}”_, is not a bounded sequence. 


< Sa, n=2,3,..., 
s 


where s, is the nth partial sum of the series; then deduce that 


L* _,(a,,/s2) converges. 


. Let X; _,a, be a convergent series of positive terms. Let r, = Lj_,,4 


Show that for m <n, 


n 
a; r 
i n 
Doei, 
=m ri Tin 


and deduce that L*_,(a,,//r,,) diverges. 


. Given the two series L¥_,(1/n), L%_,(1/n7). Show that 


. Test for convergence of the series L*_,a,, where 


(a) a, = (nl 1)", 
log(1 +n) 
Se em 
log(1 +e”) 
1X3X5X-=xX(2n-1) 1 
(c) a, = 
2X4X6X X2n 2n+1 


(d) a,=yn+vn -vn, 


2 


—1)"4" 
(e) = 
1 

(f) a, = sin [nr zl. 


ic 
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5.20. Determine the values of x for which each of the following series 


5.21. 


5.22. 


5.23. 


converges uniformly: 


(a) Pe 
(b) ro 

© PU 
o > 


Consider the series 


Let Xn -1b,„ be a certain rearrangement of X, _,a, given by 


Deo, de, ys ee a 1 1 
Leh gS Gabe ig hg Pg pee ge. tet 


where two positive terms are followed by one negative. Show that the 
sum of the original series is less than +, whereas that of the rearranged 
series (which is convergent) exceeds +5. 


Consider Cauchy’s product of L*_,a,, with itself, where 


j s n=0,1,2 
n vn+1’ Fi 3 i 


Show that this product is divergent. [ Hint: Show that the nth term of 
this product does not go to zero as n > %.] 


Consider the sequence of functions {f (x); -1, where for n = 1,2,... 


nx 


f,(x)=——, x20. 


1+nx?’ 


Find the limit of this sequence, and determine whether or not the 
convergence is uniform on [0, ~). 
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5.24. 


Consider the series 17, _,(1/n*). 

(a) Show that this series converges uniformly on [1 + ô, œ), where 6 is 
any positive number. [ Note: The function represented by this series 
is known as Riemann’s ¢-function and is denoted by £(x).] 


(b) Is £(x) differentiable on [1 + 5,~)? If so, give a series expansion 
for £'(x). 


In Statistics 


5.25. 


5.26. 


5.27. 


5.28. 


5.29. 


5.30. 


Prove formulas (5.56) and (5.57). 


Find a series expansion for the moment generating function of the 
negative binomial distribution. For what values of ¢ does this series 
converge uniformly? In this case, can formula (5.63) be applied to 
obtain an expression for the kth noncentral moment (k = 1,2,...) of 
this distribution? Why or why not? 


Find the first three cumulants of the negative binomial distribution. 


Show that the moments w, (1 =1,2,...) of a random variable X 
determine the cumulative distribution functions of X uniquely if 


1 
W gta 
is finite. 


lim sup 


noo 


[ Hint: Use the fact that n! ~ V2an"*!/7e-" as n > o.] 
Find the moment generating function of the logarithmic series distribu- 
tion, and deduce that the mean and variance of this distribution are 
given by 

w= a6/(1— 8), 


i 1 
” ee IZo Hj, 


where a= —1/log(1 — 0). 


Let {X,};_, be a sequence of binomial random variables where the 
probability mass function of X, (n = 1,2,...) is given by 


pa(k) = (") pC =p)", k=0,1,2,...,n, 
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where 0 <p <1. Further, let the random variable Y, be defined as 


Le Ps 
n 
(a) Show that E(X,,) =np and Var(X,,) =np( — p). 
(b) Apply Chebyshev’s inequality to show that 


p(i-p) 
P(IY,| >€) < ear 


where e> 0. 


(c) Deduce from (b) that Y, converges in probability to zero. [ Note: 
This result is known as Bernoulli’s law of large numbers.] 


Let X,, X,,...,X, be a sequence of independent Bernoulli random 
variables with success probability p,. Let S, = L/_,X;. Suppose that 
np, > u>Q0as n>, 


(a) Give an expression for ¢,(t), the moment generating function of 
S 


(b) Show that 
lim ¢,(t) = exp( we’ — u), 


which is the moment generating function of a Poisson distribution 
with mean p. 


5.32. Prove formula (5.75). 


CHAPTER 6 


Integration 


The origin of integral calculus can be traced back to the ancient Greeks. 
They were motivated by the need to measure the length of a curve, the area 
of a surface, or the volume of a solid. Archimedes used techniques very 
similar to actual integration to determine the length of a segment of a curve. 
Democritus (410 B.c.) had the insight to consider that a cone was made up of 
infinitely many plane cross sections parallel to the base. 

The theory of integration received very little stimulus after Archimedes’s 
remarkable achievements. It was not until the beginning of the seventeenth 
century that the interest in Archimedes’s ideas began to develop. Johann 
Kepler (1571-1630) was the first among European mathematicians to de- 
velop the ideas of infinitesimals in connection with integration. The use of 
the term “integral” is due to the Swiss mathematician Johann Bernoulli 
(1667-1748). 

In the present chapter we shall study integration of real-valued functions 
of a single variable x according to the concepts put forth by the German 
mathematician Georg Friedrich Riemann (1826-1866). He was the first to 
establish a rigorous analytical foundation for integration, based on the older 
geometric approach. 


6.1. SOME BASIC DEFINITIONS 


Let f(x) be a function defined and bounded on a finite interval [a,b]. 
Suppose that this interval is partitioned into a finite number of subintervals 
by a set of points P = {x , x,,...,x,} such that a =x <x; <x, < = <x, =b. 
This set is called a partition of [a, b]. Let Ax; =x;—x;_; €= 1,2,..., n), and 
A, be the largest of Ax,,Ax,,...,Ax,. This value is called the norm of P. 
Consider the sum 


SCP, f) = E fC) Ax 


where f; is a point in the subinterval [x,_,,x,],i=1,2,..., n. 
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The function f(x) is said to be Riemann integrable on [a, b] if a number 
A exists with the following property: For any given e>0O there exists a 
number 6> 0 such that 


|A-—S(P,f)|<e 


for any partition P of [a,b] with a norm A, < ô, and for any choice of the 
point t; in [x,;_,,x,],1=1,2,...,n. The number A is called the Riemann 
integral of f(x) on [a, b] and is denoted by f? f(x) dx. The integration symbol 
J was first used by the German mathematician Gottfried Wilhelm Leibniz 
(1646-1716) to represent a sum (it was derived from the first letter of the 


Latin word summa, which means a sum). 


6.2. THE EXISTENCE OF THE RIEMANN INTEGRAL 


In order to investigate the existence of the Riemann integral, we shall need 
the following theorem: 


Theorem 6.2.1. Let f(x) be a bounded function on a finite interval, 


[a,b]. For every partition P={x 9, x,,...,x,} of [a,b], let m; and M, be, 
respectively, the infimum and supremum of f(x) on [x;_,,x;], 1=1,2,...,n. 
If, for a given e> 0, there exists a 6 > 0 such that 

USp(f) —LSp(f) <€ (6.1) 


whenever A, < ô, where A, is the norm of P, and 


LSp(f) = £ m;Ax;, 


i=l 
USp( f) = 5 M;Ax;, 
i=1 


then f(x) is Riemann integrable on [a,b]. Conversely, if f(x) is Riemann 
integrable, then inequality (6.1) holds for any partition P such that A, < ô. 
[The sums, LS,(f) and US,(f), are called the lower sum and upper sum, 
respectively, of f(x) with respect to the partition P.] 


In order to prove Theorem 6.2.1 we need the following lemmas: 


Lemma 6.2.1. Let P and P’ be two partitions of [a,b] such that P’ >P 
(P' is called a refinement of P and is constructed by adding partition points 
between those that belong to P). Then 


USp (f) < USp(f), 
LSp(f) =LSp(f). 
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Proof. Let P= (Xo, 1». ..,X,}. By the nature of the partition P’, the ith 
subinterval Ax;=x;—x,_, is divided into k; parts AQ, A®,. Ae, where 
k,>1, i=1,2,...,n. If m and MY denete respectively, the afaa and 
supremum of fx) on A, then m; <m <M” <M, for j =1,2,...,k3 
i=1,2,...,n, where m, and M, are the infimum and supremum of fx) on 
[x;_,,x;], respectively. It follows that 


n 


n k 
LS»(f) = X m;Ax s È E mi? =LSp (f) 


i=1 i=1 j=1 


USp(f)= 3 E MOD < AAG: =US,(f). m 
i=1 j=1 i=1 


Lemma 6.2.2. Let P and P’ be any two partitions of [a,b]. Then 
LS (f) < USp (f). 


Proof. Let P” = P U P’. The partition P” is a refinement of both P and 
P'. Then, by Lemma 6.2.1, 


LSp(f) <LSp-(f) < USp.( f) < USpA(f). a 


Proof of Theorem 6.2.1 


Let e€>0 be given. Suppose that inequality (6.1) holds for any partition P 
whose norm A, is less than ô. Let S(P, f)=L7_, f(t,) Ax; where t; is a 
point in [x,;_,,x x, i=1,2,...,n. By the definition of LSp(f) and USp(f) we 
can write 


LSp(f) <S(P,f) < USp(f). (6.2) 


Let m and M be the infimum and supremum, respectively, of f(x) on [a, b]; 
then 


m(b—a) <LSp(f) <USp(f) <M(b—a). (6.3) 


Let us consider two sets of lower and upper sums of f(x) with respect to 
partitions P, P’, P”,... such that PCP’ CP" c ---. Then, by Lemma 6.2.1, 
the set of upper sums is decreasing, and the set of lower sums is increasing. 
Furthermore, because of (6.3), the set of upper sums is bounded from below 
by m(b — a), and the set of lower sums is bounded from above by M(b — a). 
Hence, the infimum of US,(f) and the supremum of LSp(f) with respect to 
P do exist (see Theorem 1.5.1). 
From Lemma 6.2.2 it is easy to deduce that 


sup LS,(f) < infUS)(f). 
P P 
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Now, suppose that for the given e> 0 there exists a ô> 0 such that 


USp(f) —LSp(f) <€ (6.4) 
for any partition whose norm A, is less than ô. We have that 
LSp(f) < supLSp(f) < infUSp(f) < USp( f). (6.5) 
P 


Hence, 


inf US (f) — supLSp(f) <e. 
K P 


Since e> 0 is arbitrary, we conclude that if (6.1) is satisfied, then 
inf USp( f) = supLSp( f). (6.6) 
P P 


Furthermore, from (6.2), (6.4), and (6.5) we obtain 
|S(P, f) -Al<e, 


where A is the common value of inf, US (f) and supp LSp(f). This proves 
that A is the Riemann integral of f(x) on [a, b]. 

Let us now show that the converse of the theorem is true, that is, if f(x) is 
Riemann integrable on [a, b], then inequality (6.1) holds. 

If f(x) is Riemann integrable, then for a given e> 0 there exists a ô> 0 
such that 


n G 
L f(t) Ax; -A] < 3 (6.7) 
i=1 
and 
n € 
E f(t) Ax -a< 5 (6.8) 
i=1 
for any partition P = {xọ, x),...,x,} of [a,b] with a norm A, < 6, and any 
choices of ¢;, t; in [x;_,,x;],i=1,2,...,n, where A = fP f(x)dx. From (6.7) 
and (6.8) we then obtain 
h f 2e 
L [AE — (4) ] Ax} < a 
i=1 


Now, M;— m; is the supremum of f(x) —f(x') for x, x’ in [x;_,,x,], i= 
1,2,..., n. It follows that for a given 7 > 0 we can choose ¢,, t; in [x;_,,x;] so 
that 


fi) FG) > Mi=m, =n,  i=1,2,...,n, 
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for otherwise M; —m;— ņ would be an upper bound for f(x) — f(x’) for all 


x,x' in [x;_,,x;], which is a contradiction. In particular, if ņ = €/[3(b — a)], 
then we can find ¢,,f; in [x;_,, x;] such that 


USp(F) ~LSp(f) = È (M,—m) Ax, 


< E [F fa] Ax + alba) 


<E. 


This proves the validity of inequality (6.1). o 


Corollary 6.2.1. Let f(x) be a bounded function on [a,b]. Then f(x) is 
Riemann integrable on [a, b] if and only if inf, US (f) = sup pLS p(f), where 
LSp(f) and USp(f) are, respectively, the lower and upper sums of f(x) with 
respect to a partition P of [a, b]. 


Proof. See Exercise 6.1. Oo 


EXAMPLE 6.2.1. Let f(x): [0,1] R be the function f(x)=x*. Then, 
f(x) is Riemann integrable on [0,1]. To show this, let P = {x 9, x,,...,x,} be 
any partition of [0,1], where x, =0, x, = 1. Then 


LSp(f) = ea Ax;, 


i=1 


USp(f) = Bees Ax;. 
i=1 
Hence, 


n 


USp(f) — LSp(f) = L (x? —x?7_1) Ax; 


i=1 


<A, Da (x? =x) 


i=1 


where Ap is the norm of P. But 


Thus 
USp(f) —LSp(f) < Ap 
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It follows that for a given e>0 we can choose ô= e such that for any 
partition P whose norm A, is less than 6, 


USp(f) — LSp(f) < €. 


By Theorem 6.2.1, f(x) =x* is Riemann integrable on [0, 1]. 


EXAMPLE 6.2.2. Consider the function f(x): [0,1] > R such that f(x) =0 
if x a rational number and f(x) = 1 if x is irrational. Since every subinterval 
of [0, 1] contains both rational and irrational numbers, then for any partition 
P={x 9, X1,...,X,} of [0,1] we have 


USp(f) = LM, Ax;= x Ax;=1, 
i=1 


j= i=1 


LS,Cf) = Som, Ax;= } 0Ax,=0. 


i=1 i=1 
It follows that 


inf US (f)=1 and supLSp(f)=0. 
K P 


By Corollary 6.2.1, f(x) is not Riemann integrable on [0, 1]. 


6.3. SOME CLASSES OF FUNCTIONS THAT ARE 
RIEMANN INTEGRABLE 


There are certain classes of functions that are Riemann integrable. Identify- 
ing a given function as a member of such a class can facilitate the determina- 
tion of its Riemann integrability. Some of these classes of functions include: 
(i) continuous functions; (ii) monotone functions; (iii) functions of bounded 
variation. 


Theorem 6.3.1. If f(x) is continuous on [a,b], then it is Riemann 
integrable there. 


Proof. Since f(x) is continuous on a closed and bounded interval, then by 
Theorem 3.4.6 it must be uniformly continuous on [a, b]. Consequently, for a 
given e> 0 there exists a 5>0 that depends only on e such that for any 
X,,xX, in [a,b] we have 


If) - $02) |< 
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if |x, -x| <6. Let P={Xxo,x),.. 
A, < 6. Then 


.,X,} be a partition of P with a norm 


n 


US?(f) — LSp(f) = D (M; -m;) Ax;, 


i=1 


where m; and M, are, respectively, the infimum and supremum of f(x) on 
EAEE xl i=1,2,...,n. By Corollary 3.4.1 there exist points é; n; in[x;_,, x;] 
such that m,=f(é),M;=f(m), i=1,2,...,n. Since |n; — ál <A,<6 for 


i=1,2,...,n, then 


USp(f) — LSp(f) = 2 [f(m) —f( é)] Ax; 


€E 


< 
b-a 


L Ax; =e. 


i=1 


By Theorem 6.2.1 we conclude that f(x) is Riemann integrable on [a, b]. 
Oo 


It should be noted that continuity is a sufficient condition for Riemann 
integrability, but is not a necessary one. A function f(x) can have discontinu- 
ities in [a,b] and still remains Riemann integrable on [a,b]. For example, 
consider the function 


_f-l1, —1<x<0, 
fay= {7 0<x<1. 


This function is discontinuous at x = 0. However, it is Riemann integrable on 
[—1,1]. To show this, let e€>0 be given, and let P = {xọ, X1,..., X„} be a 
partition of [—1,1] such that A, <e/2. By the nature of this function, 
f(x —f(x;_,) = 0, and the infimum and supremum of f(x) on [x;_,, x;] are 
equal to f(x;_,) and f(x;), respectively, i= 1,2,...,n. Hence, 


US?(f) — LSp(f) = È m,- m;) Ax; 


= 


wa 


[FE “Fs )] Ae 
hy eee 


<S[f() -f-1)] =e. 
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The function f(x) is therefore Riemann integrable on [—1,1] by Theorem 
6.2.1. 
On the basis of this example it is now easy to prove the following theorem: 


Theorem 6.3.2. If f(x) is monotone increasing (or monotone decreasing) 
on [a, b], then it is Riemann integrable there. 


Theorem 6.3.2 can be used to construct a function that has a countable 
number of discontinuities in [a,b] and is also Riemann integrable (see 
Exercise 6.2). 


6.3.1. Functions of Bounded Variation 


Let f(x) be defined on [a, b]. This function is said to be of bounded variation 
on [a,b] if there exists a number M>0 such that for any partition P= 
{x,X1,---,X,} of [a, b] we have 


n 
LIAfl <M, 
i=1 
where Af, =f(x;) —f(x;_,), 1=1,2,...,n 

Any function that is monotone increasing (or decreasing) on [a, b] is also 
of bounded variation there. To show this, let f(x) be monotone increasing on 
[a,b]. Then 


n 


re E æ) £5.) =f) Ka), 


i=1 = 


Hence, if M is any number greater than or equal to f(b)—f(a), then 
Li |Af;| <M for any partition P of [a, b]. 

Another example of a function of bounded variation is given in the next 
theorem. 


Theorem 6.3.3. If f(x) is continuous on [a,b] and its derivative f'(x) 
exists and is bounded on (a, b), then f(x) is of bounded variation on [a, b]. 


Proof. Let P={x9,Xx,,...,xX,} be a partition of [a,b]. By applying the 
mean value theorem (Theorem 4.2.2) on each [x;_,,x,], i=1,2,...,n, we 
obtain 


iM 


=) Ir A) Ax;| 
i=1 


SKY (xx) =K(b-a), 
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where x; <& <x,;, i=1,2,...,n, and K>0 is such that |f’(x)| <K on 
(a, b). o 


It should be noted that any function of bounded variation on [a, b] is also 
bounded there. This is true because if a<x<b, then P={a,x,b} is a 
partition of [a, b]. Hence, 


IF) — fla) | +1 f(b) — f(x) | <M. 


for some positive number M. This implies that |f(x)| is bounded on [a, b] 
since 


Fl lf — F(a) +1 FC) - FCO) 14 1 F(@) +f) I] 
<3[M+|f(a) +f). 


The converse of this result, however, is not necessarily true, that is, if f(x) is 
bounded, then it may not be of bounded variation. For example, the function 


T 
xos =], 0<x<l, 
2x 


0, x=0 


f(x) = 


is bounded on [0,1], but is not of bounded variation there. It can be shown 
that for the partition 


1 1 1 1 
P= 0, > EERE > „1 > 
2n 2n-1 3°2 


ETAS] >% as n > ~ and hence cannot be bounded by a constant M for 
all n (see Exercise 6.4). 


Theorem 6.3.4. If f(x) is of bounded variation on [a,b], then it is 
Riemann integrable there. 


Proof. Let €>0 be given, and let P={x 9, x,,...,x,} be a partition of 
La, b]. Then 


US,(f)—L8,(f) = ¥ (mi) Ax, (6.9) 


i=1 


where m; and M, are the infimum and supremum of f(x) on [x;_,,x;,], 
respectively, i= 1,2,...,n. By the properties of m, and M,, there exist é; 
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and n; in [x;_,, x;] such that for i=1,2,...,n, 


m; <f(&)<m;+e', 
M;— €'<f(n;) < M;, 


where e' is a small positive number to be determined later. It follows that 
M,—m,-2e' <f(n;) —f(&) <M; -—m;, i=1,2,...,n. 
Hence, 


M;-—m;<2e'+f(n) —f(&) 
<2e'+|f(&)-f(m)|, F= 12,00. 


From formula (6.9) we obtain 
USp(f) —LSp(f) <2€' 2 Ax; + 2 | F( é&) —f(1) |Ax;. (6.10) 
i=1 i=1 
Now, if A p is the norm of P, then 


E IAE) -Flax 2, E IAE) -A)| 


m 
<A, È |f) ak (6.11) 

i=1 
where {Zo, Z,,---; Zm} is a partition Q of [a,b], which consists of the points 
Xp, X4,---,X, as well as the points éi, m, &,,.--,&,7,, that is, Q is a 


refinement of P obtained by adding the és and 7,’s (i= 1,2,..., 1). Since 
f(x) is of bounded variation on [a, b], there exists a number M > 0 such that 


È fed Fels, (6.12) 


From (6.10), (6.11), and (6.12) it follows that 
US»(f) —LSp(f) <2€'(b—a) + MA,. (6.13) 


Let us now select the partition P such that A, <6, where Mô < €/2. If we 
also choose e’ such that 2e’(b—a)<e/2, then from (6.13) we obtain 
US»(f) — LSp(f) < e. The function f(x) is therefore Riemann integrable on 
La, b] by Theorem 6.2.1. oO 
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6.4. PROPERTIES OF THE RIEMANN INTEGRAL 
The Riemann integral has several properties that are useful at both the 
theoretical and practical levels. Most of these properties are fairly simple and 
striaghtforward. We shall therefore not prove every one of them in this 
section. 

Theorem 6.4.1. If f(x) and g(x) are Riemann integrable on [a, b] and if 


cı and c, are constants, then c,f(x)+c,g(x) is Riemann integrable on 
[a,b], and 


JTF) + e2g(x)] dee, f FC) de + ca [g(x de 


Theorem 6.4.2. If f(x) is Riemann integrable on [a, b], and m < f(x) <M 
for all x in [a, b], then 


m(b —a) < JF) dv <M(b ~<a), 


Theorem 6.4.3. If f(x) and g(x) are Riemann integrable on [a, b], and if 
f(x) < g(x) for all x in [a, b], then 


f'E) des s(x) dx. 


Theorem 6.4.4. If f(x) is Riemann integrable on [a,b] and if a <c <b, 
then 


S'A d= f f de fF) dx. 


Theorem 6.4.5. If f(x) is Riemann integrable on [a, b], then so is |f(x)| 
and 


MEL 


=) "| F(x) la. 


Proof. Let P={x 9, X,,...,x,} be a partition of [a,b]. Let m; and M, be 
the infimum and supremum of f(x), respectively, on [x;_,, x;]; and let m’, M} 
be the same for |f(x)|. We claim that 


! , en 
M;—m;> Mi- mM;,, i=1,2,...,n. 


It is obvious that M, — m; = M} — m, if f(x) is either nonnegative or nonposi- 
tive for all x in [x,_,,x,],i=1,2,...,n. Let us therefore suppose that f(x) is 
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negative on D7 and nonnegative on D; , where D; and D# are such that 
D; UD; =D,;=([%;_,, x;] for i= 1,2,...,n. We than have 


M;-—™; 


sup f(x) — inf f(x) 
DË D; 


sup | f(x) |+ sup | f(x)| 
D* D7 


V 


sup |f(x)|=M;, 
D; 


since supp |f(x)| = max{supp+|f(x)|,supp-|f(x)|}. Hence, M; — m; >M; > 
M; — m, for i= 1,2,...,n, which proves our claim. 


US»(Ifl) LS, (|fl) = E (M; -m,) Ax, < 
i=1 


its 


(M;—m;) Ax;. 


Hence, 
US»p(Ifl) — LSp(Ifl) < USp(f) — LSp(f). (6.14) 


Since f(x) is Riemann integrable, the right-hand side of inequality (6.14) can 
be made smaller than any given e> 0 by a proper choice of the norm A, of 
P. It follows that |f(x)| is Riemann integrable on [a,b] by Theorem 6.2.1. 
Furthermore, since ¥f(x) < |f(x)| for all x in [a,b], then f}? F f(x)dx < 
JP \f(x)| by Theorem 6.4.3, that is, 


FSI) des f'Id. 


Thus, | [?f(x) del < SEIO) dx. o 
Corollary 6.4.1. If f(x) is Riemann integrable on [a, b], then so is f?(x). 


Proof. Using the same notation as in the proof of Theorem 6.4.5, we have 
that m? and M?’ are, respectively, the infimum and supremum of f?(x) on 


[x;_,,x;] for i=1,2,...,n. Now, 


M? T. mọ =( M; -—m;)(M; + m;) 


<2M'(M!—m!') 
<2M'(M,-—™m,), i=1,2,...,n, (6.15) 
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where M’ is the supremum of |f(x)| on [a, b]. The Riemann integrability of 
f?(x) now follows from inequality (6.15) by the Riemann integrability of f(x). 
Oo 


Corollary 6.4.2. If f(x) and g(x) are Riemann integrable on [a, b], then 
so is their product f(x)g(x). 


Proof. This follows directly from the identity 


4f(x)g(x) = [FE HOT = E -8()], (6.16) 


and the fact that the squares of f(x)+g(x) and f(x) — g(x) are Riemann 
integrable on [a, b] by Theorem 6.4.1 and Corollary 6.4.1. o 


Theorem 6.4.6 (The Mean Value Theorem for Integrals). If f(x) is 
continuous on [a, b], then there exists a point c € [a, b] such that 


J'E) d= (b-a) f(e). 
Proof. By Theorem 6.4.2 we have 
1 
m < pag fie dM, 


where m and M are, respectively, the infimum and supremum of f(x) on 
[a,b]. Since f(x) is continuous, then by Corollary 3.4.1 it must attain the 
values m and M at some points inside [a, b]. Furthermore, by the intermedi- 
ate-value theorem (Theorem 3.4.4), f(x) assumes every value between m and 
M. Hence, there is a point c € [a, b] such that 


1 b 
Me) =F IO d z 
Definition 6.4.1. Let f(x) be Riemann integrable on [a, b]. The function 


F(x) = f FC) dt a<x<b, 


is called an indefinite integral of f(x). oO 


Theorem 6.4.7. If f(x) is Riemann integrable on [a,b], then F(x) = 
f(t) dt is uniformly continuous on [a, b]. 
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Proof. Let x,,x, be in [a,b], x, <x,. Then, 


Pn) -FDS rae f Ra 


= ro dt|, by Theorem 6.4.4 


< fir |dt, by Theorem 6.4.5 
<M'(x,-x;), 
where M’ is the supremum of |f(x)| on [a,b]. Thus if e> 0 is given, then 


|F(x,) — F(x,)| <€ provided that |x; —x,| < €/M'. This proves uniform 
continuity of F(x) on [a, b]. Oo 


The next theorem presents a practical way for evaluating the Riemann 
integral on [a, b]. 


Theorem 6.4.8. Suppose that f(x) is continuous on [a,b]. Let F(x) = 
f(t) dt. Then we have the following: 


i. dF(x)/dx = f(x), a <x <b. 


ii. (f(x) dx = G(b) — G(a), where G(x) = F(x) +c, and c is an arbitrary 
constant. 


Proof. We have 


S ogh IO d= i [fa fra 


1 px 
= be z HA) dt, by Theorem 6.4.4 


= li + 0h), 
jim f(x + 8h) 


by Theorem 6.4.6, where 0 < 0 < 1. Hence, 


dF(x) 


= lim f(x + 6h) = f(x) 


by the continuity of f(x). This result indicates that an indefinite integral of 
f(x) is any function whose derivative is equal to f(x). It is therefore unique 
up to a constant. Thus both F(x) and F(x)+c, where c is an arbitrary 
constant, are considered to be indefinite integrals. 
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To prove the second part of the theorem, let G(x) be defined on [a, b] as 
G(x) =F(x) +c = f f(t) dt +c, 


that is, G(x) is an indefinite integral of f(x). If x=a, then G(a) =c, since 
F(a)=0. Also, if x=b, then G(b) = F(b) +c = [?f(t) dt + G(a). It follows 
that 


fro dt = G(b) - G(a). o 


This result is known as the fundamental theorem of calculus. It is generally 
attributed to Isaac Barrow (1630-1677), who was the first to realize that 
differentiation and integration are inverse operations. One advantage of this 
theorem is that it provides a practical way to evaluate the integral of f(x) on 
[a,b]. 


6.4.1. Change of Variables in Riemann Integration 


There are situations in which the variable x in a Riemann integral is a 
function of some other variable, say u. In this case, it may be of interest to 
determine how the integral can be expressed and evaluated under the given 
transformation. One advantage of this change of variable is the possibility of 
simplifying the actual evaluation of the integral, provided that the transfor- 
mation is properly chosen. 


Theorem 6.4.9. Let f(x) be continuous on [a, 6], and let x =g(u) be a 
function whose derivative g'(u) exists and is continuous on [c,d]. Suppose 


that the range of g is contained inside [a, 6]. If a,b are points in [a, B] 
such that a =g(c) and b =g(d), then 


S'E d= f Flegu) du. 


Proof. Let F(x) = {7f(t) dt. By Theorem 6.4.8, F'(x) = f(x). Let G(u) be 
defined as 


G(u) = f flee dt. 


Since f, g, and g’ are continuous, then by Theorem 6.4.8 we have 


dG(u) 7 
du 


fle(u)le'(u). (6.17) 
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However, according to the chain rule (Theorem 4.1.3), 


dF[g(u)] _ dF[g(u)] dg(u) 


du dg(u) du 
=f[g(u)]g' (u). (6.18) 
From formulas (6.17) and (6.18) we conclude that 
G(u) -F[g(u)] =à, (6.19) 


where A is a constant. If a and b are points in [«, 8] such that a =g(c), b = 
g(d), then when u =c, we have G(c)=0 and à= —F[g(c) = —F(a)=0. 
Furthermore, when u =d, G(d) = ffflg(Ðl]g'(t) dt. From (6.19) we then ob- 
tain 


G(d) = [Fle ls") ar=F[g(a)] +A 
=F(b) 
= fF) dx. 


For example, consider the integral fF(2t°— 1)? tdt. Let x=2t?-1. 
Then dx = 4tdt, and by Theorem 6.4.9, 


E Y 1 iii 
f 1) tdt= 7 fox dx. 


An indefinite integral of x'/? is given by $x°/7. Hence, 


fer — 1)" tat = 4(2)(P? -1) = 1792-1). o 


6.5. IMPROPER RIEMANN INTEGRALS 


In our study of the Riemann integral we have only considered integrals of 
functions that are bounded on a finite interval [a,b]. We now extend the 
scope of Riemann integration to include situations where the integrand can 
become unbounded at one or more points inside the range of integration, 
which can also be infinite. In such situations, the Riemann integral is called 
an improper integral. 

There are two kinds of improper integrals. If f(x) is Riemann integrable 
on [a,b] for any b >a, then {*f(x) dx is called an improper integral of the 
first kind, where the range of integration is infinite. If, however, f(x) 
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becomes infinite at a finite number of points inside the range of integration, 
then the integral [P f(x)dx is said to be improper of the second kind. 


Definition 6.5.1. Let F(z) = [7f(x) dx. Suppose that F(z) exists for any 
value of z greater than a. If F(z) has a finite limit L as z —> œ, then the 
improper integral [*f(x)dx is said to converge to L. In this case, L 
represents the Riemann integral of f(x) on [a, œ) and we write 


i f F) dx. 


On the other hand, if L = +~, then the improper integral [7f(x) dx is said 
to diverge. By the same token, we can define the integral f4. f(x)dx as the 
limit, if it exists, of [“, f(x) dx as z >. Also, f? „f(x)dx is defined as 


ia f(x) dx = lim ie f(x) dx + lim [f(s ar, 


where a is any finite number, provided that both limits exist. 

The convergence of {7f(x)dx can be determined by using the Cauchy 
criterion in a manner similar to the one used in the study of convergence of 
sequences (see Section 5.1.1). 


Theorem 6.5.1. The improper integral /7f(x) dx converges if and only if 
for a given e> 0 there exists a zy such that 


<e, (6.20) 


IOK 


whenever z, and z, exceed Zp. 


Proof. If F(z) = {7f(x) dx has a limit L as z > œ, then for a given e> 0 
there exists z) such that for z > Zp. 


€ 
F(z)-L|< >. 
IF(2) -LI<5 
Now, if both z, and z, exceed zy, then 
22 
Er a-re -re 
21 


<|F(z,) —L|+|F(z,) -L|<e. 


Vice versa, if condition (6.20) is satisfied, then we need to show that F(z) has 
a limit as z > œ. Let us therefore define the sequence {g,}*_,, where g, is 
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given by 
a+n 
B= f f(x)dr,  n=1,2,.... 


It follows that for any e> 0, 


atn 
n- Bal =|f" fa) de] <e 


if m and n are large enough. This implies that {g,}"_, is a Cauchy sequence; 
hence it converges by Theorem 5.1.6. Let g=lim,_,.. g,. To show that 
lim, „„ F(z) =g, let us write 


|F(z) -gl=|F(z) -g,+8,-8| 
<|F(z) -g,|+ lg, —gl- (6.21) 


Suppose e> 0 is given. There exists an integer N, such that |g, — gl < €/2 if 
n > N,. Also, there exists an integer N, such that 


Fe) -sl=| f fl) al <5 (6.22) 


if z>a+n>N,. Thus by choosing z >a +n, where n > max(N,, N, — a), 
we get from inequalities (6.21) and (6.22) 


|F(z) -gl<e. 


This completes the proof. o 


Definition 6.5.2. If the improper integral /7|f(x)|dx is convergent, then 
the integral {7f(x) dx is said to be absolutely convergent. If /7f(x) dx is 
convergent but not absolutely, then it is said to be conditionally convergent. 

Oo 


It is easy to show that an improper integral is convergent if it converges 
absolutely. 

As with the case of series of positive terms, there are comparison tests that 
can be used to test for convergence of improper integrals of the first kind of 
nonnegative functions. These tests are described in the following theorems. 


Theorem 6.5.2. Let f(x) be a nonnegative function that is Riemann 
integrable on [a, b] for every b > a. Suppose that there exists a function g(x) 
such that f(x) < g(x) for x >a. If [7 g(x) dx converges, then so does [7f(x) dx 
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and we have 


JTO) des f g(x) ae. 


Proof. See Exercise 6.7. Oo 


Theorem 6.5.3. Let f(x) and g(x) be nonnegative functions that are 
Riemann integrable on [a, b] for every b >a. If 


lim A) = 
x>» g(x) 


k, 


where k is a positive constant, then /7f(x) dx and ffg(x)dx are either both 
convergent or both divergent. 


Proof. See Exercise 6.8. Oo 


EXAMPLE 6.5.1. Consider the integral [fe~*x? dx. We have that e* = 1+ 
Le _,(x"/n!). Hence, for x > 1, e* >x’/p!, where p is any positive integer. If 
p is chosen such that p — 2 > 2, then 


f f 

z P: P: 
e™x? < S 

x x 


However, f (dex?) =[|[-—1/x§ = 1. Therefore, by Theorem 6.5.2, the inte- 
gral of e~*x* on [1,%) is convergent. 


EXAMPLE 6.5.2. The integral /¢[(sin x)/(x + 1) ]dx is absolutely conver- 
gent, since 


|sin x| 1 


< 
(x+1)?~ (x+1)’ 


and 


el 


ig dx | 1 
0 (x+1) x+1 


0 


EXAMPLE 6.5.3. The integral f§(sin x/x)dx is conditionally convergent. 
We first show that /¢(sin x/x) dx is convergent. We have that 


ganna suas eS i oe 
= + : i 
i x J x J x (a 
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By Exercise 6.3, (sin x)/x is Riemann integrable on [0,1], since it is continu- 
ous there except at x =0, which is a discontinuity of the first kind (see 
Definition 3.4.2). As for the second integral in (6.23), we have for z, >z; > 1, 


z Sin x cos x 172 z, COS X 
[ye Oa =, a 


za xX x la 42, x 


COS Z4 COS Z, K cos x 
Zi Zo Zi x 
Thus 


1 1 z dX 2 
ot eect | Z=. 
Zeo Be “ea? 2i 


Since 2/z, can be made arbitrarily small by choosing z, large enough, then 
by Theorem 6.5.1, /{(sin x/x) dx is convergent and so is /¢(sin x/x) dx. 

It remains to show that /¢(sin x/x) dx is not absolutely convergent. This 
follows from the fact that (see Exercise 6.10) 


ie sin x ae 
x 


21 


nz| sin x 


dx = ©, 


lim 


x 


Convergence of improper integrals of the first kind can be used to 
determine convergence of series of positive terms (see Section 5.2.1). This is 
based on the next theorem. 


Theorem 6.5.4 (Maclaurin’s Integral Test). Let X7_;a„ be a series of 
positive terms such that a,,, <a, for n>1. Let f(x) be a positive nonin- 
creasing function defined on [1,%) such that f(n)=a,, n=1,2,..., and 
f(x) > 0 as x >. Then, Y*_,a, converges if and only if the improper 
integral [7 f(x) dx converges. 

Proof. If n> 1 and n<x<n +1, then 


a, =f(n) =f(x) =f(n +1) =a,41. 


By Theorem 6.4.2 we have for n> 1 


a,> f" eStar (6.24) 


If s, = Li_,a, is the nth partial sum of the series, then from inequality (6.24) 
we obtain 


s 2 f FO) de> Spo Say (6.25) 
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If the series L*_,a, converges to the sum s, then s>s, for all n. Conse- 
quently, the sequence whose nth term is F(n + 1) = f ?*1 f(x) dx is monotone 
increasing and is bounded by s; hence it must have a limit. Therefore, the 
integral {{f(x) dx converges. 

Now, let us suppose that //f(x) dx is convergent and is equal to L. Then 
from inequality (6.25) we obtain 


Sno Sa tf" f(x) de sa, +L, n>1, (6.26) 


since f(x) is positive. Inequality (6.26) indicates that the monotone increasing 
sequence {s,}"_, is bounded hence it has a limit, which is the sum of the 
series. o 


Theorem 6.5.4 provides a test of convergence for a series of positive terms. 
Of course, the usefulness of this test depends on how easy it is to integrate 
the function f(x). 

As an example of using the integral test, consider the harmonic series 
Xe 1(1/n). If f(x) is defined as f(x) =1/x, x> 1, then F(x) = [*f(@) dt = 
log x. Since F(x) goes to infinity as x— %, the harmonic series must 
therefore be divergent, as was shown in Chapter 5. On the other hand, the 
series ©*_,(1/n’) is convergent, since F(x) = {(dt/t?)=1-—1/x, which 
converges to 1 as x > %. 


6.5.1. Improper Riemann Integrals of the Second Kind 


Let us now consider integrals of the form ff f(x)dx where [a,b] is a finite 
interval and the integrand becomes infinite at a finite number of points 
inside [a, b]. Such integrals are called improper integrals of the second kind. 
Suppose, for example, that f(x) œ as x >a*. Then [{?f(x) dx is said to 
converge if the limit 


lim in f(x) dx 


e>0* 


exists and is finite. Similarly, if f(x) ->% as xb, then /[?f(x)dx is 
convergent if the limit 


b-e 
ji dx 
im f SO) 


exists. Furthermore, if f(x) > œ as x >c, where a <c <b, then [?f(x) dx is 
the sum of {<f(x) dx and /?f(x) dx provided that both integrals converge. By 
definition, if f(x) > © as x >x,, where x, €[a,b], then x, is said to be a 
singularity of f(x). 
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The following theorems can help in determining convergence of integrals 
of the second kind. They are similar to Theorems 6.5.1, 6.5.2, and 6.5.3. Their 
proofs will therefore be omitted. 


Theorem 6.5.5. If f(x) > as x >a*, then {?f(x) dx converges if and 
only if for a given e> 0 there exists a z) such that 


<E, 


| ROE 


where z, and z, are any two numbers such that a <z, < Z, <Zo <b. 


Theorem 6.5.6. Let f(x) be a nonnegative function such that [’f(x) dx 
exists for every c in (a, b]. If there exists a function g(x) such that f(x) < g(x) 
for all x in (a,b], and if f?g(x) dx converges as ca”, then so does 

> f(x) dx and we have 


J'E) de< [g(x ae. 


Theorem 6.5.7. Let f(x) and g(x) be nonnegative functions that are 
Riemann integrable on [c, b] for every c such that a<c <b. If 


x 
ao hop 
x>a* g(x) 
where k is a positive constant, then /?f(x)dx and /?g(x) dx are either both 
convergent or both divergent. 


Definition 6.5.3. Let fP f(x)dx be an improper integral of the second 
kind. If {?|f(x)|dx converges, then /’f(x) dx is said to converge absolutely. 
If, however, /?f(x) dx is convergent, but not absolutely, then it is said to be 
conditionally convergent. o 


Theorem 6.5.8. If /,’|f(x)|dx converges, then so does S} f(x) dx. 


EXAMPLE 6.5.4. Consider the integral fje~*x"~' dx, where n>0. If 
0 <n <1, then the integral is improper of the second kind, since x"~! > ~ as 
x — 0*. Thus, x = 0 is a singularity of the integrand. Since 


etx" 1 


lim ———— =], 
x>0* xr! 


then the behavior of /je~*x”"~' dx with regard to convergence or divergence 
is the same as that of /jx"~' dx. But /jx"~! dx =(./n)[x"]j = 1/n is con- 
vergent, and so is [ye *x"~' dx. 
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EXAMPLE 6.5.5. fo(sin x/x*) dx. The integrand has a singularity at x = 0. 
Let g(x)=1/x. Then, (sin x)/[x’g(x)] > 1 as x >0*. But fi(dx/x)= 
[log x] is divergent, since log x > —% as x >0*. Therefore, {9 (sin x/x?) dx 
is divergent. 


EXAMPLE 6.5.6. Consider the integral ff(x?— 3x + 1)/[x(x — 1)?] dx. 
Here, the integrand has two singularities, namely x=0 and x= 1, inside 
[0,2]. We can therefore write 


x? —3x4+1 12x°—3x4+1 
[yr t= i I T 
0 x(x-1) t=>0* x(x— 1) 

im f ~—**! x’ —3x+1 
1/2 x(x- 1) 


ZNL 


+ im = x(x - 1)? 


We note that 


x°—3x+1 1 1 
x(x—1) x (x-1) 
Hence, 
2x°—3x+1 i 1 ae 
f zdx= lim |log x + —— 
) x(x-1) t>0* x= I 


1 u 
+ lim hogz- =l 
u-l”— 1 1/2 


+ lim 


voit 


1 2 
l + — |. 
og x =l 


U 


None of the above limits exists as a finite number. This integral is therefore 
divergent. 


6.6. CONVERGENCE OF A SEQUENCE OF RIEMANN INTEGRALS 


In the present section we confine our attention to the limiting behavior of 
integrals of a sequence of functions {f x) 1. 
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Theorem 6.6.1. Suppose that f,(x) is Riemann integrable on [a, b] for 


n > 1. If f,(x) converges uniformly to f(x) on [a,b] as n >œ, then f(x) is 
Riemann integrable on [a, b] and 


lim fF) de= f'E) ae. 


Proof. Let us first show that f(x) is Riemann integrable on [a,b]. Let 
e> 0 be given. Since f,(x) converges uniformly to f(x), then there exists an 
integer nọ that depends only on e such that 


2) F< 375 ay (6.27) 


if n >n for all x € [a, b]. Let n, > no. Since f,(x) is Riemann integrable on 
[a, b], then by Theorem 6.2.1 there exists a ô> 0 such that 


USp( fa) -LSp( fn.) <5 (6.28) 


for any partition P of [a, b] with a norm A, < 6. Now, from inequality (6.27) 
we have 


F(x) <f,(4) + ba)” 


F(x) > fa (x) - T 
We conclude that 


USp(f) < USp( fn ) + (6.29) 


wlm wl a 


LS?(f) > LSp( fn) a (6.30) 


From inequalities (6.28), (6.29), and (6.30) it follows that if A, < 6, then 
2€ 
USp(f) — LSp(F) < USp(fn,) —ESp(Sn,) + 7 <E: (6.31) 


Inequality (6.31) shows that f(x) is Riemann integrable on [a,b], again by 
Theorem 6.2.1. 
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Let us now show that 


lim [ fal) d= fF) dx. (6.32) 


From inequality (6.27) we have for n > no, 


[E d- f'a) a < [lf — f(x) |dx 


<a> 
3 


and the result follows, since e is an arbitrary positive number. o 


6.7. SOME FUNDAMENTAL INEQUALITIES 


In this section we consider certain well-known inequalities for the Riemann 
integral. 


6.7.1. The Cauchy-Schwarz Inequality 


Theorem 6.7.1. Suppose that f(x) and g(x) are such that f?(x) and 
g(x) are Riemann integrable on [a, b]. Then 


MELS d| = [P a| fe a|, (6.33) 


The limits of integration may be finite or infinite. 


Proof. Let c, and c, be constants, not both zero. Without loss of general- 
ity, let us assume that c, #0. Then 


Plate) +c,g(x)] de>0. 


Thus the quadratic form 


ci f P(x) de + 2eqen F(x) g(x) de + c3 f'g?) de 
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is nonnegative for all c} and c,. It follows that its discriminant, namely, 


af fede) ar] -a| Po axl] f'e a 


must be nonpositive, that is, 


[Isa] < 


[P || fo ae]. 


It is easy to see that if f(x) and g(x) are linearly related [that is, there exist 
constants 7, and 7,, not both zero, such that 7,f(x)+ 7,g(x)=0], then 
inequality (6.33) becomes an equality. o 


6.7.2. Hölder’s Inequality 


This is a generalization of the Cauchy-Schwarz inequality due to Otto 
Hölder (1859-1937). To prove Hölder’s inequality we need the following 
lemmas: 


Lemma 6.7.1. Let a,,q,,...,a 
such that X;_; A; = 1. Then 


l 


Àj Az,-..,A, be nonnegative numbers 


n? 


[Ta < X a. (6.34) 
The right-hand side of inequality (6.34) is a weighted arithmetic mean of 
the a,’s, and the left-hand side is a weighted geometric mean. 


Proof. This lemma is an extension of a result given in Section 3.7 concern- 
ing the properties of convex functions (see Exercise 6.19). Oo 


Lemma 6.7.2. Suppose that f(x), f,(x),...,f,(%) are nonnegative and 
Riemann integrable on [a, b]. If A, A.,..., A, are nonnegative numbers such 
that ©7_,A,; = 1, then 


([Teolest[fima). (6.35) 


Proof. Without loss of generality, let us assume that [?f(x)dx>0 for 
i=1,2,...,n [inequality (6.35) is obviously true if at least one f(x) is 
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identically equal to zero]. By Lemma 6.7.1 we have 


[eT f(x) ae 
TT [fof ( x) de] ” 


_ pol fil) POI T ec T” 
lp oral I HEr he = 


fy x Ai f(x) i= ie 
a i=1 afi ( x) a” i=1 
Hence, inequality (6.35) follows. o 


Theorem 6.7.2 (Hölder’s Inequality). Let p and q be two positive num- 
bers such that 1/p + 1/q = 1. If |f(x)|? and |g(x)|? are Riemann integrable 
on [a, b], then 


te < [irota "| fect a] ” 


Proof. Define the functions 


u(xy=|f(x)P, v) =g). 


Then, by Lemma 6.7.2, 


perno poa] poa] 


that is, 


1/q 


[IO da fA ae] “[Plecota] "636 


The theorem follows from inequality (6.36) and the fact that 


[free ae < JIll) las. 


We note that the Cauchy-Schwarz inequality can be deduced from Theorem 
6.7.2 by taking p =q =2. o 
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6.7.3. Minkowski’s Inequality 
The following inequality is due to Hermann Minkowski (1864-1909). 


Theorem 6.7.3. Suppose that f(x) and g(x) are functions such that 
[f(x)|? and |g(x)|? are Riemann integrable on [a, b], where 1 <p <œ. Then 


[OP a) + [ieoa]. 


[PO a| < 


Proof. The theorem is obviously true if p= 1 by the triangle inequality. 
We therefore assume that p>1. Let q be a positive number such that 
1/p + 1/q = 1. Hence, p =p(1/p + 1/q4)=1 +p/q. Let us now write 


IF +P =E E +ga P4 


LIFODD + PA a(x) F(x) +ga P. 
(6.37) 


By applying Hölder’s inequality to the two terms on the right-hand side of 
inequality (6.37) we obtain 


SIEMA +g)" ae 
< [iror ae] [frw Too (6.38) 
MEO ORK OIME 
<|f ‘ecora| | aes oa). (6.39) 
From inequalities (6.37), (6.38), and (6.39) we conclude that 


Pi recor ae)” 


JIRE tela) des 


d| firota] fiora] “} 
(6.40) 
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Since 1 — 1/q =1/p, inequality (6.40) can be written as 


[I a) "+[flecora ” 


Minkowski’s inequality can be extended to integrals involving more than 
two functions. It can be shown (see Exercise 6.20) that if |f,(x)|? is Riemann 
integrable on [a, b] for i=1,2,...,n, then 


0 


a 


[ir ral < 


1/p 


P 
dx 


<5 [Pirota]. D 


i=1 


2O 


6.7.4. Jensen’s Inequality 


Theorem 6.7.4. Let X be a random variable with a finite expected value, 
u= E(X). If (x) is a twice differentiable convex function, then 


E[o(X)] > $[E(X)]. 


Proof. Since (x) is convex and ” (x) exists, then we must have $” (x) > 0. 
By applying the mean value theorem (Theorem 4.2.2) around u we obtain 


$(X) = (uw) +(X—- nu) (c), 


where c is between u and X. If X-— u> 0, then c > pu and hence #'(c)= 
'( u), since o”(x) is nonnegative. Thus, 


A(X) — CH) =(X— wv) b'(c) = (X— 4) oC»). (6.41) 
On the other hand, if X — u <0, then c < u and $'(c) < p'( u). Hence, 
A(X) — 6H) =(X— wv) b'(c) = (X— 4) o'(H). (6.42) 
From inequalities (6.41) and (6.42) we conclude that 
E[$(X) — $(4)] = ¢'(u)E(X- w) =, 
which implies that 
E[o(X)] = (w), 


since Elo w)] = dC py). o 
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6.8. RIEMANN-STIELTJES INTEGRAL 


In this section we consider a more general integral, namely the Riemann- 
Stieltjes integral. The concept on which this integral is based can be at- 
tributed to a combination of ideas by Georg Friedrich Riemann (1826-1866) 
and the Dutch mathematician Thomas Joannes Stieltjes (1856-1894). 

The Riemann-Stieltjes integral involves two functions f(x) and g(x), both 
defined on the interval [a, b], and is denoted by ff f(x) dg(x). In particular, if 
g(x)=x we obtain the Riemann integral IP f(x) dx. Thus the Riemann 
integral is a special case of the Riemann-Stieltjes integral. 

The definition of the Riemann-Stieltjes integral of f(x) with respect to 
g(x) on [a, b] is similar to that of the Riemann integral. If f(x) is bounded on 
La, b], if g(x) is monotone increasing on [a, b], and if P={xo,x,...,x,} isa 
partition of [a, b], then as in Section 6.2, we define the sums 


LS?(f,g) = me m; Ag;, 


i=1 


US?(f,g&) = D M;Ag;, 


i=1 


where m; and M, are, respectively, the infimum and supremum of f(x) on 


[x;_,,x;], Ag; =g(x;) —g(x,_,), 1=1,2,...,n. If for a given e€>0 there 
exists a 6 > 0 such that 


USp( f, 8) —LSp(f.8) <€ (6.43) 


whenever A, < ô, where A, is the norm of P, then f(x) is said to be 
Riemann-Stieltjes integrable with respect to g(x) on [a, b]. In this case, 


exes dg(x) = inf US,(f,8) = sue tees): 


Condition (6.43) is both necessary and sufficient for the existence of the 
Riemann-Stieltjes integral. 

Equivalently, suppose that for a given partition P={x,,x,,...,x,} we 
define the sum 


S(P,f.8) = Eft) As, (6.44) 


where t, is a point in the interval [x,_,,x;], i=1,2,...,n. Then f(x) is 
Riemann-Stieltjes integrable with respect to g(x) on [a,b] if for any e>0 
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there exists a ô> 0 such that 


<e (6.45) 


Ise, fs) = J'R de) 


for any partition P of [a,b] with a norm A, < ô, and for any choice of the 
point t; in [x;-1, x;], i=1,2,...,n. 

Theorems concerning the Riemann-Stieltjes integral are very similar to 
those seen earlier concerning the Riemann integral. In particular, we have 
the following theorems: 


Theorem 6.8.1. If f(x) is continuous on [a,b], then f(x) is 
Riemann-Stieltjes integrable on [a, b]. 


Proof. See Exercise 6.21. Oo 


Theorem 6.8.2. If f(x) is monotone increasing (or monotone decreasing) 
on [a,b], and g(x) is continuous on [a,b], then f(x) is Riemann-Stieltjes 
integrable with respect to g(x) on [a, b]. 


Proof. See Exercise 6.22. Oo 


The next theorem shows that under certain conditions, the Riemann- 
Stieltjes integral reduces to the Riemann integral. 


Theorem 6.8.3. Suppose that f(x) is Riemann-Stieltjes integrable with 


respect to g(x) on [a,b], where g(x) has a continuous derivative g'(x) on 
[a,b]. Then 


JIO dela) = fF) 8") ae 


Proof. Let P={x9,x,,...,x,} be a partition of [a, b]. Consider the sum 
S(P,h) = } h(t;) Ax;, (6.46) 
i=1 
where h(x) =f(x)g'(x) and x;_,; <t;<x;,i=1,2,...,n. Let us also consider 
the sum 


SCP, f, 8) = Et) As (6.47) 
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If we apply the mean value theorem (Theorem 4.2.2) to g(x), we obtain 
Ag; =8(%;) —8(%)-1) =8'(%) Ax;, b= 1, 25.2657, (6.48) 


where x;_, <z;<x,;, i=1,2,...,n. From (6.46), (6.47), and (6.48) we can 
then write 


SCP, f.g) -S(P.h)= LA) s'(z)-8)| Ax (649) 


Since f(x) is bounded on [a, b] and g'(x) is uniformly continuous on [a, b] by 
Theorem 3.4.6, then for a given e> 0 there exists a 6, > 0, which depends 
only on e, such that 


eG- a ay 


if |z; — t| < 6,, where M > 0 is such that |f(x)| <M on [a, b]. From (6.49) it 
follows that if the partition P has a norm A, < 6,, then 


IS(P, f, g) -S(P,h)|< É. (6.50) 


Now, since f(x) is Riemann-Stieltjes integrable with respect to g(x) on 
[a,b], then by definition, for the given e> 0 there exists a 5, > 0 such that 


scr. toa) f°) ae] <5. (6.51) 


if the norm A 5 of P is less than 6,. We conclude from (6.50) and (6.51) that 
if the norm of P is less than min(6,, ô), then 


<E. 


Isem- f'Id) 


Since e is arbitrary, this inequality implies that (f(x) dg(x) is in fact the 
Riemann integral [?h(x) dx = fP f(x)g'(x) dx. o 


Using Theorem 6.8.3, it is easy to see that if, for example, f(x)=1 and 
g(x) =x’, then [2 f(x) dg(x) = [Pf(x)g'(x) dx = [$ 2xdx = b? — a’. 

It should be noted that Theorems 6.8.1 and 6.8.2 provide sufficient 
conditions for the existence of ff f(x)dg(x). It is possible, however, for the 
Riemann-Stieltjes integral to exist even if g(x) is a discontinuous function. 
For example, consider the function g(x) = yI(x —c), where y is a nonzero 
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constant, a <c <b, and I(x —c) is such that 


0, x<c, 
Kx-e)={9 xc. 


The quantity y represents what is called a jump at x = c. If f(x) is bounded 
on [a, b] and is continuous at x = c, then 


J'E) deC) = yfo). (6.52) 


To show the validity of formula (6.52), let P = {x9, x,,..., x,,} be any partition 
of [a,b]. Then, Ag; =g(x;)—g(x;_,) will be zero as long as x;<c or 
X;-1 =c. Suppose, therefore, that there exists a k, 1<k<n, such that 
X,-1 <C <x. In this case, the sum S(P,f, g) in formula (6.44) takes the 
form 


S(P.f.8) = E Flt) Agi = rf). 


It follows that 


IS(P,f.8) = YFC =I v1 fC) FCE). (6.53) 


Now, let e> 0 be given. Since f(x) is continuous at x =c, then there exists a 
ô > 0 such that 


€ 
t) =Ff(e)| < —, 
if |t, — c| < 6. Thus if the norm A, of P is chosen so that A, < ô, then 


|S(P, f,8) — vf(c)l<e. (6.54) 


Equality (6.52) follows from comparing inequalities (6.45) and (6.54). 
It is now easy to show that if 


_ fa, a<x<b, 
g(x) = A x=b 


and if f(x) is continuous at x = b, then 
J'E) ag) = (A =a) (2). (6.55) 


The previous examples represent special cases of a class of functions 
defined on [a, b] called step functions. These functions are constant on [a, b] 
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except for a finite number of jump discontinuities. We can generalize 
formula (6.55) to this class of functions as can be seen in the next theorem. 


Theorem 6.8.4. Let g(x) be a step function defined on [a, b] with jump 
discontinuities at x =c,,C,,...,c¢,, and a <c; <c, < +++ <c, =b, such that 


f Ài» a<x<c, 
Ay; Ci SX <C, 
g(x)= 4: 
Àno Cy-1 SX <C,, 
Àn+1> Xx =C,- 
If f(x) is bounded on [a, b] and is continuous at x =c,,C3,...,c,, then 
b n 
JP) dg) = Li =A) FCC). (6.56) 
a i=1 


Proof. The proof can be easily obtained by first writing the integral in 
formula (6.56) as 


[fF dg(x) = (Hs) dg(x) + JI dg(x) 
tet OLO] (6.57) 
If we now apply formula (6.55) to each integral in (6.57) we obtain 
JENO dex) = 0a A) Fle), 


Eo dg(x) = (A; = Az) f(¢2), 


J” FO dela) = (Ansa =A) Flen). 


By adding up all these integrals we obtain formula (6.56). o 


EXAMPLE 6.8.1. One example of a step function is the greatest-integer 
function [x], which is defined as the greatest integer less than or equal to x. 
If f(x) is bounded on [0, n] and is continuous at x = 1,2,...,n, where n isa 
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positive integer, then by Theorem 6.8.4 we can write 
J fœ dlz]= X fG). (6.58) 
i=1 


It follows that every finite sum of the form }X'_;a; can be expressed as a 
Riemann-Stieltjes integral with respect to [x] of a function f(x) continuous 
on [0,7] such that f(@i)=a;, i=1,2,...,n. The Riemann-Stieltjes integral 
has therefore the distinct advantage of making finite sums expressible as 
integrals. 


6.9. APPLICATIONS IN STATISTICS 


Riemann integration plays an important role in statistical distribution theory. 
Perhaps the most prevalent use of the Riemann integral is in the study of the 
distributions of continuous random variables. 

We recall from Section 4.5.1 that a continuous random variable X with a 
cumulative distribution function F(x) = P(X <x) is absolutely continuous if 
F(x) is differentiable. In this case, there exists a function f(x) called the 
density function of X such that F'(x)= f(x), that is, 


F(x) = ff dt. (6.59) 


In general, if X is a continuous random variable, it need not be absolutely 
continuous. It is true, however, that most common distributions that are 
continuous are also absolutely continuous. 

The probability distribution of an absolutely continuous random variable is 
completely determined by its density function. For example, from (6.59) it 
follows that 


P(a <X<b) =F(b) — F(a) = f f(x) dx. (6.60) 


Note that the value of this probability remains unchanged if one or both of 
the end points of the interval [a,b] are included. This is true because the 
probability assigned to these individual points is zero when X has a continu- 
ous distribution. The mean u and variance o? of X are given by 


w=E(X)= ff) dx, 


o? = Var(X) = J a- u) f(x) dx. 
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In general, the kth central moment of X, denoted by m, (k =1,2,...), is 
defined as 


w= E[(X— m] =f -WO ae, (6.61) 


We note that a? = mw. Similarly, the kth noncentral moment of X, denoted 
by w, (k = 1,2,...), is defined as 


w =E(X*). (6.62) 


The first noncentral moment of X is its mean yp, while its first central 
moment is equal to zero. 

We note that if the domain of the density function f(x) is infinite, then u, 
and u, are improper integrals of the first kind. Therefore, they may or may 
not exist. If {”.,lx|*f(x) dx exists, then so does w, (see Definition 6.5.2 
concerning absolute convergence of improper integrals). The latter integral is 
called the kth absolute moment and is denoted by »v,. If v, exists, then the 
noncentral moments of order j for j <k exist also. This follows because of 
the inequality 


Ix] <|xl*+1  ifj<k, (6.63) 


which is true because |x|} <|x|* if |x| >1 and |x|} <1 if |x| <1. Hence, 
|x/| < |x| +1 for all x. Consequently, from (6.63) we obtain the inequality 


ysvy,tt, J<k, 
which implies that w, exists for j <k. Since the central moment yp; in 
formula (6.61) is expressible in terms of noncentral moments of order j or 


smaller, the existence of v, also implies the existence of p. 


EXAMPLE 6.9.1. Consider a random variable X with the density function 
—0 <x <, 


Such a random variable has the so-called Cauchy distribution. Its mean pm 
does not exist. This follows from the fact that in order of p to exist, the two 
limits in the following formula must exist: 


1 xdx 1. b xdx 


r 0 
w= — lim f 5 — lim 5 
Tar7xxt_gl1 +x Tbox/g 1+x 


(6.64) 
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But, [°,xdx/( +x?) = — Slog(1 +a?) > —% as a > œ, and fêxdx/(1 +x’) 
= jlog +b?) as b>, The integral f? xdx/(1 +x’) is therefore 
divergent, and hence u does not exist. 

It should be noted that it would be incorrect to state that 


1 i a xdx 
u= — lim 


T a> ee. co.) 


which is equal to zero. This is because the limits in (6.64) must exist for any a 
and b that tend to infinity. The limit in formula (6.65) requires that a =b. 
Such a limit is therefore considered as a subsequential limit. 

The higher-order moments of the Cauchy distribution do not exist either. 
It is easy to verify that 


- (6.66) 


is divergent for k > 1. 


EXAMPLE 6.9.2. Consider a random variable X that has the logistic 
distribution with the density function 


x 


e 
fo) = pe =o <y <, 


The mean of X is 


- | ii (6.67) 


where u =e*/(1+e*). We recognize the integral in (6.67) as being an 
improper integral of the second kind with singularities at u = 0 and u =1. 
We therefore write 


lim f’ os| Ş | du lim ” 1og[ : Ja (6.68) 
=> ens u REA H 5 
6 a>0* Ja 1l-u bo1- 71/2 8 1l—-u 
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Thus, 


w= lim [wlogu + (1—u)log(1—u)]/" 
a>0* 


+ lim [ulogu + (1 —u)log(1 — u)]?» 
b>1~ 
= lim [blogb + (1—b)log(1—b)] 
b>1~ 


— lim [aloga + (1—a)log(1—a)]. (6.69) 
a>0* 


By applying l’Hospital’s rule (Theorem 4.2.6) we find that 
lim (1—b)log(1—6)= lim aloga=0. 
bo17 a>0* 
We thus have 
u= lim (blogb) — lim [(1—a)log(1—a)] =0. 
b>17 a>0* 


2 


The variance o” of X can be shown to be equal to 77/3 (see Exercise 6.24). 


6.9.1. The Existence of the First Negative Moment 
of a Continuous Distribution 


Let X be a continuous random variable with a density function f(x). By 
definition, the first negative moment of X is E(X -1), The existence of such 
a moment will be explored in this section. 

The need to evaluate a first negative moment can arise in many practical 
applications. Here are some examples. 


EXAMPLE 6.9.3. Let Z be a population with a mean u and a variance 
a”. The coefficient of variation is a measure of variation in the population 
per unit mean and is equal to o/||, assuming that u #0. An estimate of 
this ratio is s/|y|, where s and y are, respectively, the sample standard 
deviation and sample mean of a sample randomly chosen from #. If the 
population is normally distributed, then y is also normally distributed and is 
statistically independent of s. In this case, E(s/|¥|) =E(s)EC/|yl). The 
question now is whether E(1/|y|) exists or not. 


EXAMPLE 6.9.4 (Calibration or Inverse Regression). Consider the simple 
linear regression model 


E(y) = Bo + Bix. (6.70) 


In most regression situations, the interest is in predicting the response y for 
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a given value of x. For this purpose we use the prediction equation $= Bo + 
B,x, where Bọ and £, are the least-squares estimators of Bọ and $64, 
respectively. These are obtained from the data set {(x,, y1), (X2; y>),..., 
(x,,, Yn) that results from running n experiments in which y is measured for 
specified settings of x. There are other situations, however, where the 
interest is in predicting the value of x, say x,, that corresponds to an 
observed value of y, say yọ. This is an inverse regression problem known as 
the calibration problem (see Graybill, 1976, Section 8.5; Montgomery and 
Peck, 1982, Section 9.7). 

For example, in calibrating a new type of thermometer, n readings, 
Yis Y2>---> Yp, are taken at predetermined known temperature values, 
X1,Xy,...,X, (these values are known by using a standard temperature 
gauge). Suppose that the relationship between the x,’s and the y,’s is well 
represented by the model in (6.70). If a new reading yọ is observed using the 
new thermometer, then it is of interest to estimate the correct temperature 
Xo (that is, the temperature on the standard gauge corresponding to the 
observed temperature reading yọ). 

In another calibration problem, the date of delivery of a pregnant woman 
can be estimated by the size y of the head of her unborn child, which can be 
determined by a special electronic device (sonogram). If the relationship 
between y and the number of days x left until delivery is well represented by 
model (6.70), then for a measured value of y, say yo, it is possible to estimate 
Xo, the corresponding value of x. 

In general, from model (6.70) we have E(y,) = Bo + Bixo. If 61 #0, we 
can solve for x, and obtain 


= E( yo) — Bo 
X = ———_.. 


0 
By 
Hence, to estimate x, we use 
a Yo Bo _ Yor-¥ 
Xo = X SX ae 5 
By By 


since By =F — B,x, where ¥=(1/n)E"_, x, 9 =(1/n)U"_, y; If the response 
y is normally distributed with a variance a”, then y and B, are statistically 
independent. Since yọ is also statistically independent of B, (yọ does not 
belong to the data set used to estimate 84), then the expected value of X, is 
given by 

YoY 


1 


E(X)) =X+E 


7 7 1 
=X+E(yo se| Â, | 


Here again it is of interest to know if E(1/ B) exists. 
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Now, suppose that the density function f(x) of the continuous random 


variable X is defined on (0,%). Let us also assume that f(x) is continuous. 
The expected value of X~! 


EO) = [ar (6.71) 


This is an improper integral with a singularity at x=0. In particular, if 
f0) >0, then E(X~!) does not exist, because 


lim a a 


x>0* 


= f(0) >0. 


By Theorem 6.5.7, the integrals [>(f(x)/x) dx and {j(dx/x) are of the same 
kind. Since the latter is divergent, then so is the former. Note that if f(x) is 
defined on (—%,~) and f(0)>0, then E(X~!) does not exist either. In this 
case, 


i. Mas f° FO a+ [Oa 


—o — o 


a Ay 


za |x| 


Both integrals on the right-hand side are divergent. 


A sufficient condition for the existence of E(X~') is given by the following 
theorem [see Piegorsch and Casella (1985)]: 


Theorem 6.9.1. Let f(x) be a continuous density function for a random 
variable X defined on (0, ©). If 


lim Ae) 


z <% for some a> 0, (6.72) 
x>0t x 


then E(X~!) exists. 


Proof. Since the limit of f(x)/x* is finite as x 0*, there exist finite 
constants M and ô> 0 such that f(x)/x*<M if 0 <x < ô. Hence, 


_ Mé* 


a 


jee ) 


APPLICATIONS IN STATISTICS 245 


Thus, 


E(X"1) = 


x» f(x) af(*) x» f(x) 
fg Oa One 


a a 


Mô 1 „o P3 Mô 1 
< — +- < ——+-<o, o 
a s/f) — a ô 


It should be noted that the condition of Theorem 6.9.1 is not a necessary 
one. Piegorsch and Casella (1985) give an example of a family of density 
functions that all violate condition (6.72), with some members having a finite 
first negative moment and others not having one (see Exercise 6.25). 


Corollary 6.9.1. Let f(x) be a continuous density function for a random 
variable X defined on (0,~) such that f(0)= 0. If f'(0) exists and is finite, 
then E(X~!) exists. 


Proof. We have that 


ee To O 2. O 


x207 x 


By applying Theorem 6.9.1 with œ = 1 we conclude that E(X~') exists. o 


EXAMPLE 6.9.5. Let X be a normal random variable with a mean u and 
a variance g’. Its density function is given by 


1 1 : 
f2) = omer oxp| = a(x w)"| —0<x<%, 


In this example, f(0)>0. Hence, E(X!) does not exist. Consequently, 
EQ/\y|) in Example 6.9.3 does not exist if the population Z is normally 
distributed, since the density function of |9| is positive at zero. Also, in 
Example 6.9.4, EQ/ B,) does not exist, because Ê, is normally distributed if 
the response y satisfies the assumption of normality. 


EXAMPLE 6.9.6. Let X be a continuous random variable with the density 


function 


1 
Baoe O o nT R2 
f(x)= T(n/an"?* e ; 0<x<%, 
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where n is a positive integer and I'(/2) is the value of the gamma function, 
Joe *x"/?—! dx. This is the density function of a chi-squared random variable 
with n degrees of freedom. 

Let us consider the limit 


x 1 
lim Mx) = lim x”! e-* 7? 


x>0t x“ I'(n/2)2"” x>0F 


for a> 0. This limit exists and is equal to zero if n/2 — a—1> 0, that is, if 
n>21+a)>2. Thus by Theorem 6.9.1, E(X~') exists if the number of 
degrees of freedom exceeds 2. 

More recently, Khuri and Casella (2002) presented several extensions and 
generalizations of the results in Piegorsch and Casella (1985), including a 
necessary and sufficient condition for the existence of E(X =); 


6.9.2. Transformation of Continuous Random Variables 


Let Y be a continuous random variable with a density function f(y). Let 
W = WY), where ẹy(y) is a function whose derivative exists and is continuous 
on a set A. Suppose that w’(y) #0 for all y <A, that is, wy) is strictly 
monotone. We recall from Section 4.5.1 that the density function g(w) of W 
is given by 


g(w) =f y (w Jana wEB, (6.73) 


where B is the image of A under y and y= # '(w) is the inverse function 
of w = w(y). This result can be easily obtained by applying the change of 
variables technique in Section 6.4.1. This is done as follows: We have that for 
any w, and w, such that w; < w3, 


P(w, <W<w,) = [g(w) dw. (6.74) 


If y'(y)>0, then w(y) is strictly monotone increasing. Hence, 
P(w,<W<w,) =P(y,<Y<y>), (6.75) 


where y, and y, are such that w; = W(y,),w, = (y,). But 
y2 
P(y<Y <y) =f f(y) d. (6.76) 
yı 


Let us now apply the change of variables y = y~ !(w) to the integral in (6.76). 
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By Theorem 6.4.9 we have 


d 1 
[ foa- ffl l L = a (6.77) 
From (6.75) and (6.76) we obtain 
P(w,< Wsw) = f fle (w peo ~ ls (6.78) 


On the other hand, if w’(y)<0, then P(w,<W<w,)=P(y,<Y<y,). 
Consequently, 


P(w, <W<w,) = f(y) dy 
bo 


Wi dy 
= fro E aw 
= 1 
=E ere CP w (6.79 
By combining (6.78) and (6.79) we obtain 


vw) aw (6.80) 


P(w, sWsn)= f ‘fly (w | 


Formula (6.73) now follows from comparing (6.74) and (6.80). 


The Case Where w = y(y) Has No Unique Inverse 


Formula (6.73) requires that w = w#(y) has a unique inverse, the existence of 
which is guaranteed by the nonvanishing of the derivative w'(y). Let us now 
consider the following extension: The function w(y) is continuously differen- 
tiable, but its derivative can vanish at a finite number of points inside its 
domain. We assume that the domain of y(y) can be partitioned into a finite 
number, say n, of disjoint subdomains, denoted by 4, J,,...,Z,, on each of 
which w(y) is strictly monotone (decreasing or increasing). Hence, on each J; 
(i=1,2,...,2), wy) has a unique inverse. Let y; denote the restriction of 
the function y to I, that is, (y) has a unique inverse, y= ys '(w), 
i=1,2,...,n. Since J, 1,,..., I, are disjoint, for any w, and w, such that 
WwW, <w, we have 


P(w, <W <w,)= z P[Y € y7 '(w,,w2)], (6.81) 


248 INTEGRATION 


where '(w,,w,) is the inverse image of [w,,w,], which is a subset of Z, 
i=1,2,...,n. Now, on the ith subdomain we have 


P|Ye p ' (wi, w2)] an r fo) dy 


= {Flue (w m| HO) ly, =1,2,...,n, 


(6.82) 


where T, is the image of y; '(w,,w,) under y;. Formula (6.82) follows from 
applying formula (6.80) to the function ¥,(y), i=1,2,...,n. Note that T,= 
wl (w, w) =[v,,w,]90 %C,), i=1,2,..., (why?). Thus T, is a subset 
of both [w,,w,] and w(/;). We can therefore write the integral in (6.82) as 


MO lay = f 


Aw) [Ww wl 


dy, '(w) | 


JAE] le 


where 6(w)=1 if w € (I;,) and 6(w) =0 otherwise, i=1,2,...,n. Using 
(6.82) and (6.83) in formula (6.81), we obtain 


P(w,<W<wy) = X fam) slur (w TA aa = 


i=1 1 


W, 


OCC pjan i 


1 i=l 


from which we deduce that the density function of W is given by 


dy, (w) | 
dw Í 


g(w) = È sovrla] (6.84) 


EXAMPLE 6.9.7. Let Y have the standard normal distribution with the 
density function 


2: 


1 
fy) = peew(- 2], -<y < om, 


Define the random variable W as W = Y?. In this case, the function w =y? 
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has two inverse functions on (—%,%), namely, 


Thus Z =(—%,0], 1,=(0,~), and w,(7/,) =[0,%), W(7,) =(0,%). Hence, 
6,(w) = 1, 6,(w) = 1 if w € (0, ~). By applying formula (6.84) we then get 


1 -1 1 1 
w)= e 7/2) — 4+ ——e"/2| 
g(w) V2T 2yw V2T 2yw 
1 e "7? 
=, w>0. 
Vr vw 


This represents the density function of a chi-squared random variable with 
one degree of freedom. 


6.9.3. The Riemann-Stieltjes Integral Representation of the Expected Value 


Let X be a random variable with a cumulative distribution function F(x). 
Suppose that A(x) is Riemann-Stieltjes integrable with respect to F(x) on 
(—, 0). Then the expected value of A(X) is defined as 


E[n(X)] =f W(x) dF(x). (6.85) 


Formula (6.85) provides a unified representation of expected values for both 
discrete and continuous random variables. 

If X is a continuous random variable with a density function f(x), that is, 
F'(x) = f(x), then 


E{h(X)] = f h(x) f(x) de, 


If, however, X has a discrete distribution with a probability mass function 
p(x) and takes the values c,,c,...,C,, then 


sono 


EMOD] = Eae) ple). (6.86) 


Formula (6.86) follows from applying Theorem 6.8.4. Here, F(x) is a step 
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function with jump discontinuities at c,,c,,...,c, such that 
(0, =% <x <c] 
phc), ci <x <C, 


pci) +p(cr), C2 Sx <Cs, 
F(x)=\ p 
È pc), Cn-1 SX <C,, 
i=1 


Yi p(c)) =1, c,<x<™, 
i=1 
Thus, by formula (6.56) we obtain 
f A(x) dF(x) = X ple)a(c). (6.87) 
2% i=1 


For example, suppose that X has the discrete uniform distribution with 
P(x)=1/n for x =c,,C>,...,C,- Its cumulative distribution function F(c) 
can be expressed as 


1 n 
F(x) =P[X<x]=— } I(x-<¢,), 
ae 
where I(x—c;) is equal to zero if x<c; and is equal to one if x>c; 
(i =1,2,...,n). The expected value of A(X) is 


E[h(X)] = 


Sle 


Ee). 


EXAMPLE 6.9.8. The moment generating function ¢(¢) of a random 
variable X with a cumulative distribution function F(x) is defined as the 
expected value of A(X) =e", that is, 


p(t) =E(e'*) = ie dF (x), 


where ¢ is a scalar. If X is a discrete random variable with a probability mass 
function p(x) and takes the values c,,c,,...,C,,---, then by letting n go to 
infinity in (6.87) we obtain (see also Section 5.6.2) 


(1) = X plede". 


APPLICATIONS IN STATISTICS 251 


The moment generating function of a continuous random variable with a 
density function f(x) is 


b(t) =f eM f(x) dx. (6.88) 

The convergence of the integral in (6.88) depends on the choice of the scalar 

t. For example, for the gamma distribution G(a, 8) with the density function 
x a-1l e* JB 


Mx) = Tay pe”? 


a>0, B>0, 0<x<%, 


p(t) is of the form 


wo etx ant e*/B 


NES Toe 
_ pex% exp[—x(1 — Bt) /B] 
0 T(a)B* a 
If we set y=x(1 — Bt)/B, we obtain 
i p By errs. 
oS mo aa 
Thus, 
E 1 oy te? 
$(t) saan Tay Ê 


=(L= pty; 


since fọ e ’y*~' dy =T(a) by the definition of the gamma function. We note 
that #(¢) exists for all values of a provided that 1 — Br > 0, that is, t < 1/8. 


6.9.4. Chebyshev’s Inequality 


In Section 5.6.1 there was a mention of Chebyshev’s inequality. Using the 
Riemann-Stieltjes integral representation of the expected value, it is now 
possible to provide a proof for this important inequality. 


Theorem 6.9.2. Let X be a random variable (discrete or continuous) with 


a mean p and a variance o°. Then, for any positive constant r, 


1 
P(|X- u| Sto) < —. 
r 
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Proof. By definition, 7” is the expected value of h(X) =(X — yw)’. Thus, 
o?= f (xp) dF(x), (6.89) 
where F(x) is the cumulative distribution function of X. Let us now partition 


(—«,«) into three disjoint intervals: (—%, u—ro], (w-ro,utro), 
[w+roa,%). The integral in (6.89) can therefore be written as 


o= fi (x- ny de(x) f a= u)? dF(x) 
+f (ea) aF() 


> J e- u)? dF(x) + J e u dF(x). (6.90) 


We note that in the integral in (6.90), x < u — ro, so that x — u < —ro. 
Hence, (x — pe)? >r’a7. Also, in the second integral, x— w>ro. Hence, 
(x- u? zro’. Consequently, 


ne (x — u)? dF(x) > ro A "AF(x) =r? P(X < p-ro), 
Pelee u)? dF(x) >r? f dF(x) =rw?P(X>u+ro). 


From inequality (6.90) we then have 
o’ >ro’[P(X-— u< -ro)+P(X-puzro)] 
=r %w?P(|X-— ul >ro), 
which implies that 


1 
P(|X- ul >ro)< 5. o 
r 
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EXERCISES 
In Mathematics 
6.1. Let f(x) be a bounded function defined on the interval [a, b]. Let P be 
a partition of [a, b]. Show that f(x) is Riemann integrable on [a, b] if 


and only if 


inf USp( f) = sup LSp(f) = f F(x) de. 
P P a 
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6.2. 


6.3. 


6.4. 


6.5. 


6.6. 


INTEGRATION 


Construct a function that has a countable number of discontinuities in 
[0,1] and is Riemann integrable on [0, 1]. 


Show that if f(x) is continuous on [a, b] except for a finite number of 
discontinuities of the first kind (see Definition 3.4.2), then f(x) is 
Riemann integrable on [a, b]. 


Show that the function 


xcos(7/2x), O<x<1, 


rwy A 


is not of bounded variation on [0, 1]. 


Let f(x) and g(x) have continuous derivatives with g'(x) > 0. Suppose 

that lim, „„ f(x) =~, lim, „„ gv) =%, and lim, _,., f(x) /g’(x) =L, 

where L is finite. 

(a) Show that for a given e> 0 there exists a constant M > 0 such that 
for x > M, 


| f'(x) —Lg'(x)|< eg'(x). 


Hence, if A, and A, are such that M <A, < M, then 


JEO -Leld 


Ag 
< eg'(x) dk. 
je) 
(b) Deduce from (a) that 


ae -t]<et f(A) | g(A,) 


7) eye EO). 


(c) Make use of (b) to show that for a sufficiently large A,, 


and hence lim, „„ f(x)/g(x) =L. 
[ Note: This problem verifies l’Hospital’s rule for the © /- indeterminate 
form by using integration properties without relying on Cauchy’s mean 
value theorem as in Section 4.2 (see Hartig, 1991)]. 


Show that if f(x) is continuous on [a, b], and if g(x) is a nonnegative 
Riemann integrable function on [a,b] such that {?9(x)dx>0, then 


EXERCISES 255 


6.7. 


6.8. 


6.9. 


6.12. 


6.13. 


there exists a constant c, a <c <b, such that 


f(x) g(x) de 
Ieg(x) dx 


f(c). 


Prove Theorem 6.5.2. 
Prove Theorem 6.5.3. 


Suppose that f(x) is a positive monotone decreasing function defined 
on [a,%) such that f(x) -0 as x—œ. Show that if f(x) is 
Riemann-Stieltjes integrable with respect to g(x) on [a,b] for every 
b >a, where g(x) is bounded on [a, ©), then the integral [7 f(x) dg(x) is 
convergent. 


. Show that lim, _,., Jọ 7|(sin x)/x|dx = %, where n is a positive integer. 


[ Hint: Show first that 


1 1 
—+ 


fone + |1ųÁÃ—meeeeeeooososo 
x x+T x+(n-1)7 


Te 
dx = f sin x 
0 


dx.] 


. Apply Maclaurin’s integral test to determine convergence or divergence 


of the following series: 


(a) 2 logn 

a a ae, 
n=1 nyn 

i < nt+4 

ip) > | 2n? +1? 

x 1 
c =; 
(o) nay EFASI 


Consider the sequence {f,(x)}*_,, where f,(x) =nx/( + nx’), x> 0. 
Find the limit of f? f(x)dx as n > &. 


Consider the improper integral {jx~'(1 —x)"~! dx. 


256 INTEGRATION 


(a) Show that the integral converges if m > 0, n > 0. In this case, the 
function B(m, n) defined as 


B(m,n) fiat ex) at 


is called the beta function. 
(b) Show that 


T/2 
B(m,n) =2 f7 sin?" o cos?" odo 
0 


(c) Show that 


=i = xrl 


B(m,n) = [ee ae = ff wen di 
o (1+x) o (1+x) 


(d) Show that 


m-1 n-1 

1x +x 
B(m,n) = mn 
( ) | es i 


6.14. Determine whether each of the following integrals is convergent or 


divergent: 
(a) — 
(a) f EEEn 


6.15. Let f(x) and f (x) be bounded on [a,b], and g(x) be monotone 
increasing on [a, b]. If f(x) and f (x) are Riemann-Stieltjes integrable 
with respect to g(x) on [a,b], then show that f,(x)f,(x) is also 
Riemann-Stieltjes integrable with respect to g(x) on [a, b]. 
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6.16. Let f(x) be a function whose first n derivatives are continuous on 
[a,b], and let 


h,(x) =f(b) —f(x) —(b—x) f'(x) - -OD pen, 
Show that 
h,(a) = aaa fe x)" p(x) de 
and hence 
f(b) -fo + (Oa) (ay + E por gay 


tp L O-TON A 


This represents Taylor’s expansion of f(x) around x=a (see Section 
4.3) with a remainder R,, given by 


1 


Bae TEN. 


"(b =x)" f(x) dx. 


[Note: This form of Taylor’s theorem has the advantage of providing an 
exact formula for R„, which does not involve an undetermined number 
6, as was seen in Section 4.3.] 


6.17. Suppose that f(x) is monotone and its derivative f'(x) is Riemann 


integrable on [a, b]. Let g(x) be continuous on [a, b]. Show that there 
exists a number c, a <c <b, such that 


b c b 
S IC)8(*) de =f (a) f g(x) de + f(b) f g(x) dx. 
6.18. Deduce from Exercise 6.17 that for any b >a > 0, 


p sin x 
f dx 
Pee 


4 
<—. 
a 


6.19. Prove Lemma 6.7.1. 
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6.20. Show that if f,(x), f,(v),...,f,(%) are such that |f,(x)|? is Riemann 
integrable on [a, b] for i=1,2,...,n, where 1 <p <~, then 


flrs È [a]. 


i=1 


p 


EA) 


6.21. Prove Theorem 6.8.1. 


6.22. Prove Theorem 6.8.2. 


In Statistics 


6.23. Show that the integral in formula (6.66) is divergent for k > 1. 


6.24. Consider the random variable X that has the logistic distribution 
described in Example 6.9.2. Show that Var(X) = 77/3. 


6.25. Let {f ,(x)}%-; be a family density functions defined by 


|log” x| Ro 


Sèllog” t| 7t dt’ 


0<x<À, 


fax) = 


where A € (0, 1). 

(a) Show that condition (6.72) of Theorem 6.9.1 is not satisfied by any 
fix), n= 1. 

(b) Show that when n=1, E(X~') does not exist, where X is a 
random variable with the density function f,(x). 

(© Show that for n > 1, E(X7') exists, where X, is a random variable 
with the density function f,,(x). 


6.26. Let X be a random variable with a continuous density function f(x) on 
(0,%). Suppose that f(x) is bounded near zero. Then E(X~“) exits, 
where «€ (0, 1). 


6.27. Let X be a random variable with a continuous density function f(x) on 
(0, œ). If lim, _, o+(f(x)/x®) is equal to a positive constant k for some 
a> 0, then ELX~“"**] does not exist. 
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6.28. The random variable Y has the ¢t-distributions with n degrees of 
freedom. Its density function is given by 


(> 
2 


p] 


2 ~(n+1)/2 
y 


f(y) = 


5 =~% <y <%, 


n 


where T(m) is the gamma function jeer a dx, m>O. Find the 
density function of W = | Y|. 


6.29. Let X be a random variable with a mean p and a variance o”. 


(a) Show that Chebyshev’s inequality can be expressed as 


o? 


> 
r? 


P(|X- ul >r) < 
where r is any positive constant. 
(b) Let {X,}°_, be a sequence of independent and identically dis- 


tributed random variables. If the common mean and variance of 
the X;’s are u and o”, respectively, then show that 


where X, = (1/n)d"_,X; and r is any positive constant. 
(c) Deduce from (b) that X, converges in probability to u as n > %, 
that is, for every e> 0, 


P(X,- ul >€)>0 as n> o, 


6.30. Let X be a random variable with a cumulative distribution function 
F(x). Let w, be its kth noncentral moment, 


u, =E(X*)= f x*dF(x). 
Let v, be the kth absolute moment of X, 
k ek 

v =E(IXI*)= f |x\dF(x). 


Suppose that v, exists for k = 1,2,...,n. 
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(a) Show that vf < 1%; %%41, K=1,2,...,2—1. [Hint: For any u 
and v, 


æ% 2 2 
o< f [ulļxl“ DA yx #+9] dF(x) 
=u, ; + 2uvv, + vva. ] 


(b) Deduce from (a) that 


v< vi? < ye < < yl”, 


CHAPTER 7 


Multidimensional Calculus 


In the previous chapters we have mainly dealt with real-valued functions of a 
single variable x. In this chapter we extend the notions of limits, continuity, 
differentiation, and integration to multivariable functions, that is, functions 
of several variables. These functions can be real-valued or possibly vector-val- 
ued. More specifically, if R” denotes the n-dimensional Euclidean space, 
n> 1, then we shall in general consider functions defined on a set D C R” 
and have values in R”, m > 1. Such functions are represented symbolically as 
f: D > R”, where for x = (x4, X3, ..., X )' E D, 


f(x) = [AE RE) aO 


and f,(x) is a real-valued function of x4, x2,..., X, @=1,2,...,m). 

Even though the basic framework of the methodology in this chapter is 
general and applies in any number of dimensions, most of the examples are 
associated with two- or three-dimensional spaces. At this stage, it would be 
helpful to review the basic concepts given in Chapters 1 and 2. This can 
facilitate the understanding of the methodology and its development in a 
multidimensional environment. 


7.1. SOME BASIC DEFINITIONS 


Some of the concepts described in Chapter 1 pertained to one-dimensional 
Euclidean spaces. In this section we extend these concepts to higher-dimen- 
sional Euclidean spaces. 

Any point x in R” can be represented as a column vector of the form 
(X1,%,---,%X,)’, where x; is the ith element of x ((=1,2,...,n). The 
Euclidean norm of x was defined in Chapter 2 (see Definition 2.1.4) as 
Ill, = (0"_,x?)!”*. For simplicity we shall drop the subindex 2 and denote 
this norm by |Ikx|l. 
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Let x) E€ R”. A neighborhood N,(x,) of xq is a set of points in R” that lie 
within some distance, say r, from Xo, that is, 


N(x) = {x € R"||k — xoll <r}. 


If x, is deleted from N,(x,), we obtain the so-called deleted neighborhood of 
Xo, which we denote by N“(x,). 

A point x, in R” is a limit point of a set A CR” if every neighborhood of 
Xo contains an element x of A such that x#x,, that is, every deleted 
neighborhood of x, contains points of A. 

A set A CR" is closed if every limit point of A belongs to A. 

A point x, in R” is an interior of a set A CR” if there exists an r > 0 such 
that N(x) CA. 

A set A C R” is open if for every point x in A there exists a neighborhood 
N(x) that is contained in A. Thus A is open if it consists entirely of interior 
points. 

A point p € R” is a boundary point of a set A C R” if every neighborhood 
of p contains points of A as well as points of A, the complement of A with 
respect to R”. The set of all boundary points of A is called its boundary and 
is denoted by Br(A). 

A set A C R” is bounded if there exists an r > 0 such that ||| <r for all x 
in A. 

Let g: Jt—> R” be a vector-valued function defined on the set of all 
positive integers. Let g(i)=a,, i> 1. Then {a,}7_, represents a sequence of 
points in R”. By a subsequence of {a,};_; we mean a sequence {a,};_, such 
that kı <k,< +++ <k;< = and k;>i for i= 1 (see Definition 5.1.1). 

A sequence {a}*_, converges to a point c € R” if for a given e> 0 there 
exists an integer N such that ||a;— cel| <e whenever i >N. This is written 
symbolically as lim, „„ a; = c, or a; >¢ as i > %. 

A sequence {a,}*_, is bounded if there exists a number K >0 such that 
lla; <K for all i. 


7.2. LIMITS OF A MULTIVARIABLE FUNCTION 


We recall from Chapter 3 that for a function of a single variable x, its limit at 
a point is considered when x approaches the point from two directions, left 
and right. Here, for a function of several variables, say x4, X3, ..., X„, its limit 
at a point a=(a,,a,,...,a,)’ is considered when x = (x4, X3,...,X„)" ap- 
proaches a in any possible way. Thus when n > 1 there are infinitely many 
ways in which x can approach a. 


Definition 7.2.1. Let f: D > R”, where D cR”. Then f(x) is said to have 
a limit L = (L4, L,,..., L„)' as x approaches a, written symbolically as x — a, 
where a is a limit point of D, if for a given e> 0 there exists a ô> 0 such 


LIMITS OF A MULTIVARIABLE FUNCTION 263 


that |f) — L|| < e for all xin D N N£(a), where N£(a) is a deleted neighbor- 
hood of a of radius 6. If it exists, this limit is written symbolically as 
lim, „a f) =L. o 


Note that whenever a limit of f(x) exists, its value must be the same no 
matter how x approaches a. It is important here to understand the meaning 
of “x approaches a.” By this we do not necessarily mean that x moves along a 
straight line leading into a. Rather, we mean that x moves closer and closer 
to a along any curve that goes through a. 


EXAMPLE 7.2.1. Consider the behavior of the function 


x3—x} 
Xi, X 
Fanz) = Bg 


as x = (x4, X,)’ > 0, where 0 = (0,0)'. This function is defined everywhere in 
R? except at 0. It is convenient here to represent the point x using polar 
coordinates, r and 6, such that x, =r cos 0, x, =r sin 0,r>0, 0<0<27. 
We then have 


r? cos? 0 — r? sin? 0 


r? cos? 0 + r° sin? 0 


f(x, x2) = 
=r(cos? 0 — sin? 0). 


Since x—>0 if and only if r—0, lim, „o f(x;,x,) =0 no matter how x 
approaches 0. 


EXAMPLE 7.2.2. Consider the function 


XX 


fa x2) = => 


x? +x 
Using polar coordinates again, we obtain 
f(x, x2) = cos Osin 6, 


which depends on 0, but not on r. Since 0 can have infinitely many values, 
f(x, x2) cannot be made close to any one constant L no matter how small r 
is. Thus the limit of this function does not exist as x > 0. 
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EXAMPLE 7.2.3. Let f(x,, x2) be defined as 


x(x? +x3) 


f(%1,%2) = 


x3 +(x? +23) 


This function is defined everywhere in R? except at (0,0). On the line 
x,=0, f0, x2) =x}/(xj +x3), which goes to zero as x, > 0. When x, =0, 
f(x,,0) =0 for x, #0; hence, f(x;,0)—>0 as x, > 0. Furthermore, for any 
other straight line x, = tx; (¢ # 0) through the origin we have 


t(x? + tx?) 


f(x, t1) = 


txt (x? +1?x?)”’ 
t,(1 +t’) 


+x2(14+02)°” 


x, #0, 


which has a limit equal to zero as x, > 0. We conclude that the limit of 
f(x,, x2) is zero as x > 0 along any straight line through the origin. However, 
f(x x2) does not have a limit as x > 0. For example, along the circle 
x, =x? +x} that passes through the origin, 


(x? +43) 1 T 
OR erie entre a ee 


Hence, f(x,,x,) > 4 #0. 
This example demonstrates that a function may not have a limit as x >a 
even though its limit exists for approaches toward a along straight lines. 


7.3. CONTINUITY OF A MULTIVARIABLE FUNCTION 


The notion of continuity for a function of several variables is much the same 
as that for a function of a single variable. 


Definition 7.3.1. Let f: D > R”, where DCR", and let a € D. Then f(x) 
is continuous at a if 


lim f(x) = f(a), 

x>a 
where x remains in D as it approaches a. This is equivalent to stating that for 
a given e> 0 there exits a 6>0 such that 

lif (x) — f(a)||< € 
for all x € DNN,(a). 
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If f(x) is continuous at every point x in D, then it is said to be continuous 
in D. In particular, if f(x) is continuous in D and if 6 (in the definition of 
continuity) depends only on e (that is, ô is the same for all points in D for 
the given e), then f(x) is said to be uniformly continuous in D. m 


We now present several theorems that provide some important properties 
of multivariable continuous functions. These theorems are analogous to those 
given in Chapter 3. Let us first consider the following lemmas (the proofs are 
left to the reader): 


Lemma 7.3.1. Every bounded sequence in R” has a convergent subse- 
quence. 


This lemma is analogous to Theorem 5.1.4. 


Lemma 7.3.2. Suppose that f, g: D > R are real-valued continuous func- 
tions, where D c R”. Then we have the following: 


1. f+g, f—g, and fg are continuous in D. 
2. |fl is continuous in D. 
3. 1/f is continuous in D provided that f(x) + 0 for all x in D. 


This lemma is analogous to Theorem 3.4.1. 

Lemma 7.3.3. Suppose that f: D > R” is continuous, where D C R”, and 
that g: G > R” is also continuous, where G C R” is the image of D under f. 
Then the composite function gef: D —> R”, defined as gof(x) = g[f(x)], is 
continuous in D. 

This lemma is analogous to Theorem 3.4.2. 

Theorem 7.3.1. Let f: D—R be a real-valued continuous function 


defined on a closed and bounded set D C R”. Then there exist points p and q 
in D for which 


f(p) = sup f(x), (7.1) 
f(a) = inf f(x). (7.2) 


Thus f(x) attains each of its infimum and supremum at least once in D. 


Proof. Let us first show that f(x) is bounded in D. We shall prove this by 
contradiction. Suppose that f(x) is not bounded in D. Then we can find a 
sequence of points {p,}*_, in D such that |f(p,)| >i for i>1 and hence 
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If(p)| >% as i> œ. Since the terms of this sequence are elements in a 
bounded set, {p,}7_, must be a bounded sequence. By Lemma 7.3.1, this 
sequence has a convergent subsequence {p} };_,. Let po be the limit of this 
subsequence, which is also a limit point of D; hence, it belongs to D, since D 
is closed. Now, on one hand, |f(p;)| > |f(po)| as i ©, by the continuity of 
f(x) and hence of |f(x)| [see Lemma 7.3.2(2)]. On the other hand, |f(p x > ©, 
This contradiction shows that f(x) must be bounded in D. Consequently, the 
infimum and supremum of f(x) in D are finite. 

Suppose now equality (7.1) does not hold for any p € D. Then, M — f(x) > 0 
for all x € D, where M = sup, e p f(x). Consequently, [M — f(x)]~' is positive 
and continuous in D by Lemma 7.3.2(3) and is therefore bounded by the first 
half of this proof. However, if 6 > 0 is any given positive number, then, by the 
definition of M, we can find a point x; in D for which f(x;) >M — ô, or 


1 < 1 
M—f(xs) 5 


This implies that [M —f(x)]~! is not bounded, a contradiction, which proves 
equality (7.1). The proof of equality (7.2) is similar. o 


Theorem 7.3.2. Suppose that D is a closed and bounded set in R”. If 
f: D > R” is continuous, then it is uniformly continuous in D. 


Proof. We shall prove this theorem by contradiction. Suppose that f is not 
uniformly continuous in D. Then there exists an e> (Q0 such that for every 
65> 0 we can find a and b in D such that ||a — b|| < 6, but ||f(a) — f(b)|| > e. 
Let us choose 6=1/i, i21. We can therefore find two sequences 
{a}7_,, {b,}#_, with a;b; € D such that |la; —b,ll<1/i, and 


lit(a;) — f(b, I> (7.3) 


for i> 1. Now, the sequence {a,}*_, is bounded. Hence, by Lemma 7.3.1, it 
has a convergent subsequence {a, };_, whose limit, denoted by ag, is in D, 
since D is closed. Also, since f is continuous at ay, we can find a A >0 such 
that |If(x) — f(ay)|| < €/2 if |x — agll <A, where x € D. By the convergence of 
{a, Jj, to ay, we can choose k; large enough so that 


1 


a 7.4 
AR g 
oo (74) 


and 


À 
laz, —agll< ok (7.5) 
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From (7.5) it follows that 
€ 
IIf(a,,) —f(ay)Il< 5. (7.6) 
Furthermore, since laz, — b, ll <1/k;, we can write 
Ib, = ayll < laz, — aol + laz, = b,,,|l 


A 1 n 
ar SS KA 
2 k; 


t 


Hence, by the continuity of f at ao, 
€ 
Ilf(b,.) — f(ao)ll < 3° (7.7) 


From inequalities (7.6) and (7.7) we conclude that whenever k; satisfies 
inequalities (7.4) and (7.5), 


lIf(ax,) — f(b, JI <Iif(ay,) —£(ao)Il + If (by) —F(a)Il<e, 


which contradicts inequality (7.3). This leads us to assert that f is uniformly 
continuous in D. o 


7.4. DERIVATIVES OF A MULTIVARIABLE FUNCTION 


In this section we generalize the concept of differentiation given in Chapter 4 
to a multivariable function f: D > R”, where D C R”. 


Let a=(aj,a),...,a,)' be an interior point of D. Suppose that the limit 
. £(a,,45,...,a, +h,,...,a,) — f(a), a5,...,4;,...,4,) 
lim 
h;>0 h; 


L 


exists; then f is said to have a partial derivative with respect to x; at a. This 
derivative is denoted by df(a)/0x;, or just f(a), i=1,2,...,n. Hence, partial 
differentiation with respect to x, is done in the usual fashion while treating 
all the remaining variables as constants. For example, if f: R? > R is defined 
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as f(x 1, X2, X3) =x, x5 +x,x3, then at any point x € R? we have 


af) 
Ox, a 

Of (x 

fœ) =2x,x, +x3, 
OX> 

I) es Ss 

ax, 2X3. 


In general, if f; is the jth element of f (j =1,2,...,m), then the terms 
AF (x) /OXx;, for i=1,2,...,n; j=1,2,...,m, constitute an mXn matrix 
called the Jacobian matrix (after Carl Gustav Jacobi, 1804-1851) of f at x and 
is denoted by J, (x). If m =n, the determinant of J,(x) is called the Jacobian 
determinant; it is sometimes represented as 


CLG cru fase- fa) 


(Xi, Rosco Xn) 


(7.8) 


det[J,(x)] = 


For example, if f: R? > R? is such that 
f(x1, X2, x3) = (xj cos x3, x3 +x3 e™)', 
then 


pA . 
2x; cosx, —xi sin x, 0 


Jp (415 X2, X3) = . 
x e“ 2x, 2x, e 


Higher-order partial derivatives of f are defined similarly. For example, the 
second-order partial derivative of f with respect to x; at a is defined as 


| f (4,ā2,...,4;+hi,..., 0p) = Ey (41, 425...34is...34p) 

lim aaua 

h;>0 h; 

and is denoted by 0*f(a)/dx?, or f, @. Also, the second-order partial 
derivative of f with respect to x; and x;, i #j, at a is given by 


Pd yseioN pnigl VA Gy ans aitlss ei An) 


lim 
hj>0 h; 


and is denoted by ó’f(a)/ dx; ðX;, Or f, .(a), iFj. 
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Under certain conditions, the order in which differentiation with respect 
to x; and x; takes place is irrelevant, that is, f, x(a) is identical to f, , (a), 
i#j. This property is known as the commutative property of partial différen- 
tiation and is proved in the next theorem. 


Theorem 7.4.1. Let f: D—R”, where DCR", and let a be an interior 
point of D. Suppose that in a neighborhood of a the following conditions are 
satisfied: 


1. óf(x)/ðx; and f(x) / dx; exist and are finite (i, j =1,2,...,n, i#/). 
2. Of the derivatives d7f(x)/dx; óx, ó’ fO) dx; dx, one exists and is 


continuous. 


Then 
0°f(a) ’f(a) 
OX; AX; ðX; OX; 


Proof. Let us suppose that d7f(x)/ dx, x; exists and is continuous in a 
neighborhood of a. Without loss of generality we assume that i<j. If 
0°f(a)/ dx; dx; exists, then it must be equal to the limit 


£,(41,425...,4; +hj,..+,4n) =E (G1, O75 0065 jn ++ +4 An) 


lim 
h;>0 h, , 
that is, 
1 
lim — | lim — (f(a), a, SOs Wags td Bs Mien Oe) 


0,43. an te) (7.9) 


Let us denote ECL Map tere Xj Es isa Xp) EG aaa gre iia Xp) by 
U(x), X,-..,X,,). Then the double limit in (7.9) can be written as 


an a gr Matar PNG ory weekly) 


Udy a35- Ajs: tha vis 05) 


AW(a,,4,,...,a;+ Ohj,... fete (7.10) 


= lim lim 
h;>0 h;>0 h; ðx 
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where 0 < 6, <1. In formula (7.10) we have applied the mean value theorem 
(Theorem 4.2.2) to us as if it were a function of the single variable x; (since 
óf/ðx;, and hence dis/0ox, exists in a neighborhood of a). The right-hand 
side of (7.10) can then be written as 


lim lim 
h,>0 h;>0 h; 


Ox, 


L 


1 [Aeee Opec a; + hjr., ap) 


Ox; 


L 


óf (a;,a3,..., a; + siose 


= lim lim 
h;>0 h;>0 Ox; 


7 


Ox; 


L 


ð Haea Oih; ... a; + Met 


(7.11) 


where 0 < 6;<1. In formula (7.11) we have again made use of the mean 
value theorem, since 0*f(x)/ dx; 0x; exists in the given neighborhood around 
a. Furthermore, since d7f(x) A OX; OX; is continuous in this neighborhood, the 
double limit in (7.11) is equal to 0’ f(a)/ OX; OX;. This establishes the asser- 
tion that the two second-order partial derivatives of f are equal. o 


EXAMPLE 7.4.1. Consider the function f: R? > R, where f(x,, x3, x3) = 
x,e"2 +x, cos x,. Then 


f(x , X2, X3) i . 
=e"—x, sin X], 
Ox, 
Of (X14, X2, X 
(1 4a) %3) =x,e"? + COS X}, 
OX> 
I F(X X2, X3) ESS 
=¢™® — SiN x}, 
OX, OX, 
a’ f(x , X2 X3) x : 
=e—sin x4. 
OX, OX 


7.4.1. The Total Derivative 


Let f(x) be a real-valued function defined on a set D CR”, where x= 
(x,,X,...,%,)'. Suppose that x,,x,,...,x,, are functions of a single variable 
t. Then f is a function of t. The ordinary derivative of f with respect to t, 
namely df/dt, is called the total derivative of f. 
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Let us now assume that for the values of t under consideration dx;/dt 
exists for i=1,2,...,m and that df(x)/dx, exists and is continuous in the 
interior of D for i=1,2,...,n. Under these considerations, the total deriva- 
tive of f is given by 


df 2 f(x) dx; 

=} 9 (7.12) 
dt Z ox; dt 

To show this we proceed as follows: Let Ax,, Ax,,...,Ax,, be increments of 

X1,X,...,X, that correspond to an increment At of t. In turn, f will have 


the increment Af. We then have 
Af=f(%, + Ax, X3 + AxX3,..., Xa + Ax,) = f(X1; X25. Xn). 
This can be written as 


Af =| f(x; + Ax,, X3 +Ax,,...,x, + Ax,) 
of ( tats Oe Ae) 
+[ f(x, x2 AX G5 +404 Xp + AX,) 
—=f( x1, X2, X3 ANG -3 Xp + AX, )] 
tl fy x2; X3 + Ads 0125, + AX,) 
=f (Xa X2, X3, X4 ARs ease, + AX) | 


Herh [F ti oina AX) = S (r kes Ka): 


By applying the mean value theorem to the difference in each bracket we 
obtain 


Of (x, + 0, Ax,, xX, FA k + Ax, 
aay, ez 1AX1, X2 2 ) 


Ox, 


Of (X1,%_ + 0, Ax,, X3 + AXx3,...,x, + AX) 


+ Ax, 
OX 


Of (x1, X3, X3 + 0, AX3, X4 + Axy,...,x, + Ax,) 


+ Ax, 
OX; 


Of (Xaar ira a +O, AX 
+o +Ax, ( 1 2 aE n n n) 
x 


n 
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where 0 < 0, < 1 for i=1,2,...,n. Hence, 


Af Ax, df( x, + 0, Axi, x, + Ax,,...,x, + AX,) 
At At Ox, 


Ax, ðf(x1, X2 + 0, Ax, X3 + Ax3,...,x, + Ax,) 

At OX, 

Ax, f(x, X3, X3 + 03 Ax5,x%,+AXx,,...,x, + Ax, ) 
+ 

At OX, 


AX, If (%4,% 250065 X_—-15X, + 8, AX,) 
T : 


7.13 
At Ox ( ) 


n 


As At>0, Ax;/At > dx;/dt, and the partial derivatives in (7.13), being 
continuous, tend to df(x)/ox,; for i=1,2,...,n. Thus Af/At tends to the 
right-hand side of formula (7.12). 

For example, consider the function f(x,,x,)=x{—x3, where x,= 
e' cos t, x, = cos t + sin t. Then, 


af 
ao 2x,(e' cost — e‘ sin t) — 3x2( —sin t + cos t) 


= 2e' cos t(e' cos t — e' sin t) — 3(cos t + sin t)’(— sin t + cos t) 


= (cos ¢ — sin t)(2e” cos t — 6 sin t cos t — 3). 


Of course, the same result could have been obtained by expressing f directly 
as a function of t via x, and x, and then differentiating it with respect to t. 
We can generalize formula (7.12) by assuming that each of x4, x3, ..., X, iS 
a function of several variables including the variable t. In this case, we need 
to consider the partial derivative ðf/ ót, which can be similarly shown to have 
the value 
ð n Of(&) OX; 
uS LE (7.14) 


j-1 OX; ðt 


L 


In general, the expression 


-E AO a 


i=1 


(7.15) 


is called the total differential of f at x. 
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EXAMPLE 7.4.2. Consider the equation f(x,,x,)=0, which in general 
represents a relation between x, and x,. It may or may not define x, as a 
function of x,. In this case, x, is said to be an implicit function of x,. If x, 
can be obtained as a function of x,, then we write x, =g(x,). Consequently, 
flx,, g(x,)] will be identically equal to zero. Hence, f[x,, g(x,)], being a 
function of one variable x,, will have a total derivative identically equal to 
zero. By applying formula (7.12) with t =x, we obtain 


fof | af dk, 


=0. 
dx, 0X, Ax, dx, 


If df/dx, #0, then the derivative of x, is given by 


dx, —df/ ax, 
ie Of / ax,” (7.16) 


In particular, if f(x,,x,)=0 is of the form x,—h(x,)=0, and if this 
equation can be solved uniquely for x, in terms of x,, then x, represents the 
inverse function of h, that is, x, =h~'(x,). Thus according to formula (7.16), 


dh 1 
dx,  dh/dx,” 


This agrees with the formula for the derivative of the inverse function given 
in Theorem 4.2.4. 


7.4.2. Directional Derivatives 


Let f£ D —> R”, where DCR", and let v be a unit vector in R” (that is, a 
vector whose length is equal to one), which represents a certain direction in 
the n-dimensional Euclidean space. By definition, the directional derivative 
of f at a point x is the interior of D in the direction of v is given by the limit 


f(x + hv) — f(x) 


h>0 h , 


if it exists. In particular, if v=e,, the unit vector in the direction of the ith 
coordinate axis, then the directional derivative of f in the direction of v is just 
the partial derivative of f with respect to x; (i =1,2,...,7). 


Lemma 7.4.1. Let f: D — R”, where DCR". If the partial derivatives 
df;/ dx, exist at a point x =(x,,X,...,x,)’ in the interior of D for i= 
1,2,...,n; j=1,2,...,m, where fj is the jth element of f, then the direc- 
tional derivative of f at x in the direction of a unit vector v exists and is equal 
to J,(x)v, where J,(x) is the m Xn Jacobian of f at x. 
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Proof. Let us first consider the directional derivative of f; in the direction 
of v. To do so, we rotate the coordinate axes so that v coincides with the 
direction of the &,-axis, where éj, €,,...,&, are the resulting new coordi- 
nates. By the well-known relations for rotation of axes in analytic geometry of 
n dimensions we have 


n 
x= yt D Edis i=1,2,...,n, (7.17) 
l=2 
where vu, is the ith element of v (i = 1,2,...,n) and A; is the ith element of 
A, the unit vector in the direction of the &-axis (/ = 2,3,..., 7). 
Now, the directional derivative of f; in the direction of v can be obtained 
by first expressing f; as a function of é, €,...,&, using the relations 


(7.17) and then differentiating it with respect to é. By formula (7.14), this is 
equal to 


of, 2 óf, ax; 
0b, ja OX; OF; 


= £ v, j=1,2,...,m. (7.18) 


From formula (7.18) we conclude that the directional derivative of f= 
(fis f2»---> fm) in the direction of v is equal to J,(x)v. o 


EXAMPLE 7.4.3. Let f: R? > R? be defined as 


f(x), X2, X3) = 


2 2 2 
x? +x +x? | 


2 2 
Xj XX, ta 


The directional derivative of f at x =(1,2,1)’ in the direction of v=(1/ V2, 
—1/V2,0)’ is 


1 
2 
2x 2x 2x 
J,(x)v= 1 2 3 1 
2X, —-X, —-xX, 2x; azod ~ 7 
0 
l =2 
2: ELECE 
IE si tdha v2 
Ouse 23) a 1 
2) |e 
0 
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Definition 7.4.1. Let f: D >R, where DCR". If the partial derivatives 


óf/ðx; (i= 1,2,...,n) exist at a point x =(x,,X5,...,%,,)’ in the interior of 
D, then the vector (df/dx,, Of/0x5,..., 0f/Ax,)' is called the gradient of f 
at x and is denoted by V/f(x). oO 


Using Definition 7.4.1, the directional derivative of f at x in the direction 
of a unit vector v can be expressed as Vf(x)'v, where V/f(x)’ denotes the 
transpose of Vf(x). 


The Geometric Meaning of the Gradient 


Let f: D >R, where DCR". Suppose that the partial derivatives of f exist 
at a point x = (x,,x,,...,x,,)’ in the interior of D. Let C denote a smooth 
curve that lies on the surface of f(x) = co, where cy is a constant, and passes 
through the point x. This curve can be represented by the equations x, = 
g(t), xX, =g,(t),..., x, =g,(t), where a <t<b. By formula (7.12), the total 
derivative of f with respect to t at x is 


x) dg. 
x FR) g, 


` ox. dt’ 


i=1 i 


af 
ras (7.19) 
The vector A = (dg, /dt, dg,/dt,...,dg,,/dt)’ is tangent to C at x. Thus from 
(7.19) we obtain 


df l 
gA. (7.20) 


Now, since f[g,(t), g,(t),...,8,(0)] = co along C, then df/dt = 0 and hence 
Vf(x)'A =0. This indicates that the gradient vector is orthogonal to A, and 
hence to C, at x € D. Since this result is true for any smooth curve through x, 
we conclude that the gradient vector Vf(x) is orthogonal to the surface of 
f(x) =cp at x. 


Definition 7.4.2. Let f: D >R, where DCR”. Then Vf: D > R”. The 
Jacobian matrix of V(x) is called the Hessian matrix of f and is denoted by 
H p00). Thus H,(x) =J vQ), that is, 


B 9°) P) 
Ox? OX OX, OX, OX4 

H,(x) = ; ; (7.21) 
B f(x) P) 


OX, óX, OX, OX, Ox? 
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The determinant of H œ) is called the Hessian determinant. If the conditions 
of Theorem 7.4.1 regarding the commutative property of partial differentia- 
tion are valid, then H,(x) is a symmetric matrix. As we shall see in Section 
7.7, the Hessian matrix plays an important role in the identification of 
maxima and minima of a multivariable function. Oo 


7.4.3. Differentiation of Composite Functions 


Let f: D; > R”, where D, CR", and let g: D, > R”, where D, CR”. Let x, 
be an interior point of D, and f(x) be an interior point of D,. If the m xn 
Jacobian matrix J,(x,) and the p Xm Jacobian matrix J [fœ] both exist, 
then the p Xn Jacobian matrix J,(x,) for the composite function h = gof 
exists and is given by 


Jn(Xo) = J [E(x0)]J; (x0). (7.22) 


To prove formula (7.22), let us consider the (k, i)th element of J,(x,), namely 
Oh (x,))/dx;, where h,(x,) =g,[f(x,)] is the kth element of h(x,) = glf(xy)], 
i=1,2,...,n; k=1,2,...,p, By applying formula (7.14) we obtain 


Ah,(Xo) I8uLE (Xo) | AFi(X0) 
Ox óf; Od 


Ms 


=1,2,...,n;k =1,2,..., p, 


i i 


a. 
ll 
=. 


(7.23) 


where f;xo) is the jth element of f(x), j = 1,2,..., m. But dg,[f&,)I/ ðf; is 
the (k, j)th element of J [f&o)], and F(X) / OX; is the (j,i)th element of 
J,(xy), §=1,2,...,m; f=1,2,...,.m; k=1,2,...,p. Hence, formula (7.22) 
follows from formula (7.23) and the rule of matrix multiplication. 

In particular, if m =n =p, then from formula (7.22), the determinant of 
J,(Xo) is given by 


det[Jn(Xo)] = det[ J (f (x0) )]det[J;(x0)]. (7.24) 
Using the notation in formula (7.8), formula (7.24) can be expressed as 


A(hy, hz,... hp) _ OCP 14805225 8a) a( fis fase- fa) 
O( Xi, X25.. Xn) OW fis faso fa) A(X, Xas: Xn) a 


(7.25) 


EXAMPLE 7.4.4. Let f: R? > R? be given by 


x? — x, COS x, 
f(x, x2) = XX 


3 3 
xi XA 


TAYLOR’S THEOREM FOR A MULTIVARIABLE FUNCTION 277 


Let g: R? >R be defined as 


8( £1, 2, &3) = 1 — E7 + 3; 


where 


eee 
éi =Xj — xX, COS X}, 


E = XX7, 
= Sa 23 
£3 =X] +X}. 
In this case, 
2x, +x, sin x} —Cos xı 
Jx) = X2 x4 ; 
2 2 
3x; 3x5 


J,[f(x)] E (1,— 2n, 1). 
Hence, by formula (7.22), 


2x; +x, sin x} —cos x; 
J(x)=(1,- 23,1) ae Xi 
3x? axe 


= (2x, +x, sin x; — 2x,x3 + 3x], — cos xı — 2x]x, + 3x3). 


7.5. TAYLOR’S THEOREM FOR A MULTIVARIABLE FUNCTION 


We shall now consider a multidimensional analogue of Taylor’s theorem, 
which was discussed in Section 4.3 for a single-variable function. 

Let us first introduce the following notation: Let x =(x,,%),...,%,)'. 
Then x'V denotes a first-order differential operator of the form 


The symbol V, called the del operator, was used earlier to define the 
gradient vector. If m is a positive integer, then (x'V)” denotes an mth-order 
differential operator. For example, for m =n = 2, 


yy? ð a \ 
' =|x +x 
(x'V) Vax, ax, 


ð? ð? a 
2 2 
=x +2x,x +x i 
1 ax? Max, 0x, óx? 
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Thus (x’V)? is obtained by squaring x, 0/dx,+x,0/dx, in the usual 
fashion, except that the squares of @/dx, and d/0x, are replaced by 07/dx; 
and 3*/dx3, respectively, and the product of 0/dx, and d/dx, is replaced 
by 07/dx, 0x, (here we are assuming that the commutative property of 
partial differentiation holds once these differential operators are applied to a 
real-valued function). In general, (x’V)” is obtained by a multinomial expan- 
sion of degree m of the form 


ð” 
ry” m ki yk k 
x’V = XPIAXÉ2 ee yn A 
( ) k, R ae 1 72 n axk axke ir axkn 
where the sum is taken over all n-tuples (k,,k,,...,k,,) for which L?_,k; =m, 
and 


ki, kas... kp 7 k,!k,! ok! ` 


If a real-valued function f(x) has partial derivatives through order m, then 
an application of the differential operator (x’V)” to f(x) results in 


' m m 
Gyre. = f ; p ebb 
lee ee 


n "f 


t ki gxk2 wee gyk” 
OX}! 0X3? t OX, 


(7.26) 


The notation (x' V )”f(xo) indicates that &' V)” f(x) is evaluated at xp. 


Theorem 7.5.1. Let f: D—R, where DCR”, and let N,(x,) be a 
neighborhood of x €D such that N;(x,)cCD. If f and all its partial 
derivatives of order <r exist and are continuous in N;(x,), then for any 
x EN;(x9), 


r-1 =X, 'V i Xo X—X, Vv r m 
=at E É VIII) | [aV] A 


5 J , (7.27) 


where Z, is a point on the line segment from x, to x. 


Proof. Let h=x—xp. Let the function p(t) be defined as f(t) = f(x, + 
th), where 0 <¢ <1. If t=0, then (0) =f(x,) and 60) =f, + h) =f(x), if 
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t= 1. Now, by formula (7.12), 


ae n Of(x) 
= Xhi Ox 
j= i |x=xgtth 
= (h’V) f(x + th), 
where h, is the ith element of h (i = 1,2,..., n). Furthermore, the derivative 


of order m of (t) is 


d"$(t) 


a =(h'V)” f(x, + th), l<m<r. 


Since the partial derivatives of f through order r are continuous, then the 
same order derivatives of (t) are also continuous on [0, 1] and 


d$(t) 
dt” 


=(K'V)”f(x,), 1<m<r. 
=0 


If we now apply Taylor’s theorem in Section 4.3 to the single-variable 
function (t), we obtain 


-1 t! df(t) t d'o(t) 
(1) = 6(0) + EF dt? f0 r! dt 


l (7.28) 


t=§ 


where 0 < é< t. By setting t= 1 in formula (7.28), we obtain 


1 [(x -= xo) V] F(X) i [S - xo) V] F(z) 


i! r! 


f(x) =f(x0) + 2 


where Zo =x, + éh. Since 0< ¿< 1, the point z, lies on the line segment 
between x, and x. o 


In particular, if f(x) has partial derivatives of all orders in N;(xọ), then we 
have the series expansion 


ai x-x) V] Xo 
x L VVTI): 


i! 


f(x) =f(x0) + (7.29) 


In this case, the last term in formula (7.27) serves as a remainder of Taylor’s 
series. 
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EXAMPLE 7.5.1. Consider the function f: R? > R defined as f(x,, x)= 


XX) +x? + e“ cos x,. This function has partial derivatives of all orders. Thus 
in a neighborhood of x, = (0,0)’ we can write 


1 1 
F(x x2) = 1+ (x'V) (0,0) + = (xV)? A(0,0) + = V PE, Ex), 
0<€<1. 


It can be verified that 
(x’V) f(0, 0) =x, 
(x’V)’ (0,0) = 3x? + 2x,x, — x2, 
(x! VY? fC €x,, €x,) =x} e& cos( éx) — 3x7x, eĉ" sin( éx) 
— 3x,x3 e%1 cos( Ex,) +x3 e& sin( Ex,). 


Hence, 
f(%1, x2) 


1 
=1+x,+ zp (3xi + 241% =x) 


1 3 2] 5 éx, 3 2 Ex, a 
+ rae - 3x,x5| e% cos( éx) + [x3 - 3x75 | e** sin( Ex,)}. 


The first three terms serve as a second-order approximation of f(x,, x,), 
while the last term serves as a remainder. 


7.6. INVERSE AND IMPLICIT FUNCTION THEOREMS 


Consider the function f: D > R”, where D CR”. Let y = f(x). The purpose of 
this section is to present conditions for the existence of an inverse function 
f~' which expresses x as a function of y. These conditions are given in the 
next theorem, whose proof can be found in Sagan (1974, page 371). See also 
Fulks (1978, page 346). 


Theorem 7.6.1 (Inverse Function Theorem). Let f: D —> R”, where D is 
an open subset of R” and f has continuous first-order partial derivatives 
in D. If for some x) € D, the m Xn Jacobian matrix J,(x) is nonsingular, 
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that is, 
OU fis f2- fa) 
det = #0, 
` [3 Œo)] O( X1, X25.. Xn) X=X9 
where f, is the ith element of f (i = 1,2,..., n), then there exist an e > 0 and 


a ê> 0 such that an inverse function f~' exists in the neighborhood N,[f(x)] 
and takes values in the neighborhood N.(x,). Moreover, f~' has continuous 
first-order partial derivatives in N,[f(x,)], and its Jacobian matrix at f(x,) is 
the inverse of J,(x,); hence, 


1 
det} J,-1| f(x = oes 7.30 
Belo = ser.) een 
EXAMPLE 7.6.1. Let f: R? > R? be given by 
2X E —Xy 
f(x1, X2, X3) = | x? +x, + oe 
XiX +X, 
Here, 
2x, 2x,-1 0 


J, (x) = | 2x; 1 4x; |, 
X, xrl 0 


and det[J,(x)] = —12x,x,. Hence, all x @R* at which x,x,#0, f has an 
inverse function f~'. For example, if D = {(x,, x,, x3)|x, > 0, x, > 0}, then f is 
invertible in D. From the equations 


Yı = 2X1 X_—Xy, 


yp =x? +x, + 2X3, 
Y3 5 X1x t X3, 


we obtain the inverse function x = f~! (y), where 


yityz3 
x, =—— _.,, 
2y; =y; 
—y,+2y, 
Xa = a 
1/2 
1 (yi +y) 2y; =y; 


X3 = Y= = 
v2 (2y3-y,)° 3 
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If, for example, we consider x, =(1,1,1)’, then y, =f(xy) = (1, 4,2)’, and 
det[J,(x))] = —12. The Jacobian matrix of f~! at yọ is 


zo -3 
Je(Yo) = = 0 a 
-i 4 0 


Its determinant is equal to 


det[Jr-1(yo) | ies D 


We note that this is the reciprocal of det[J,(x,)], as it should be according to 
formula (7.30). 


The inverse function theorem can be viewed as providing a unique 
solution to a system of n equations given by y = f(x). There are, however, 
situations in which y is not explicitly expressed as a function of x. In general, 
we may have two vectors, x and y, of orders n X 1 and m X 1, respectively, 
that satisfy the relation 


g(x,y) = 9, (7.31) 


where g: R”*” +R”. In this more general case, we have n equations 
involving m +n variables, namely, the elements of x and those of y. The 
question now is what conditions will allow us to solve equations (7.31) 
uniquely for x in terms of y. The answer to this question is given in the next 
theorem, whose proof can be found in Fulks (1978, page 352). 


Theorem 7.6.2 (Implicit Function Theorem). Let g: D —> R”, where D is 
an open subset of R”*”, and g has continuous first-order partial derivatives 
in D. If there is a point z) E€ D, where Zo = (x), yj)’ with xy E R”, yo ER” 
such that g(z,) = 0, and if at Zo, 


O( i 85 ss 25 8n 
(8182 bn) 40, 
Øf Xi, Xz,- Xp) 


where g, is the ith element of g (i = 1,2,..., n), then there is a neighborhood 
N;(Yo) of yo in which the equation g(x, y) = 0 can be solved uniquely for x as a 
continuously differentiable function of y. 
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EXAMPLE 7.6.2. Let g: R? > R? be given by 


x, +x, +y?-18 
g(x X2, y) = oe | 
X,—X4xX,+y—-4 
We have 
ð , 1 1 
EE al, EEE 
(x4, X3) TX =% 


Let z=(x,,x,,y)’. At the point z) = (1,1,4)', for example, g(z,)=0 and 
J 81, 8r)/A(X1, x2) = —1 #0. Hence, by Theorem 7.6.2, we can solve the 
equations 


xı +x, +y” — 18=0, (7.32) 
x= XxX +y—4=0 (7.33) 


uniquely in terms of y in some neighborhood of yọ = 4. For example, if D in 
Theorem 7.6.2 is of the form 


D = {( x1, x2, y) lx; > 0, x, > 0, y < 4.06}, 


then from equations (7.32) and (7.33) we obtain the solution 
3 1/2 
n= -0-17 + [0-17 -4y + 16] \, 


1/2 
x)= 3{19 -y?- [(»?- 17)" -4y + 16] k: 


We note that the sign preceding the square root in the formula for x, was 
chosen as +, so that x,=x,=1 when y=4. It can be verified that 
(y? — 17? — 4y + 16 is positive for y < 4.06. 


7.7. OPTIMA OF A MULTIVARIABLE FUNCTION 


Let f(x) be a real-valued function defined on a set D C R”. A point x, € D is 
said to be a point of local maximum of f if there exists a neighborhood 
N;(X)) CD such that f(x) < f(x) for all x E€ N;(xo). If f(x) = f(x,) for all 
x EN,(x,), then x, is a point of local minimum. If one of these inequalities 
holds for all x in D, then x, is called a point of absolute maximum, or a point 
of absolute minimum, respectively, of f in D. In either case, x, is referred to 
as a point of optimum (or extremum), and the value of f(x) at x = xq is called 
an optimum value of f(x). 


284 MULTIDIMENSIONAL CALCULUS 


In this section we shall discuss conditions under which f(x) attains local 
optima in D. Then, we shall investigate the determination of the optima of 
f(x) over a constrained region of D. 

As in the case of a single-variable function, if f(x) has first-order partial 
derivatives at a point x, in the interior of D, and if x, is a point of local 
optimum, then df/dx,=0 for i=1,2,...,n at x9. The proof of this fact is 
similar to that of Theorem 4.4.1. Thus the vanishing of the first-order partial 
derivatives of f(x) at x) is a necessary condition for a local optimum at x, 
but is obviously not sufficient. The first-order partial derivatives can be zero 
without necessarily having a local optimum at x». 

In general, any point at which df/dx,;=0 for i=1,2,...,n is called a 
stationary point. It follows that any point of local optimum at which f has 
first-order partial derivatives is a stationary point, but not every stationary 
point is a point of local optimum. If no local optimum is attained at a 
stationary point xy, then x, is called a saddle point. The following theorem 
gives the conditions needed to have a local optimum at a stationary point. 


Theorem 7.7.1. Let f: D > R, where DCR". Suppose that f has contin- 
uous second-order partial derivatives in D. If xy is a stationary point of f, 
then at x, f has the following: 


i. A local minimum if (h’V)*f(x,) >0 for all h=(A,,h,,...,/,)' ina 
neighborhood of 0, where the elements of h are not all equal to zero. 
ii. A local maximum if (h’V)*f(x,) <0, where h is the same as in (i). 
iii. A saddle point if (h’V)’f(x,) changes sign for values of h in a 
neighborhood of 0. 


Proof. By applying Taylor’s theorem to f(x) in a neighborhood of x9 we 
obtain 


1 
F(X +h) =f(Xo) + (AV) F (Xo) + z V f); 


where h is a nonzero vector in a neighborhood of 0 and z, is a point on the 
line segment from x, to xy+h. Since x, is a stationary point, then 
(h'V)f(x,) = 0. Hence, 


1 
f (Xo +h) —f(Xo) = z f). 


Also, since the second-order partial derivatives of f are continuous at xj, 
then we can write 


(xa +B) = f(a) = = (WV)? Fo) + UI, 
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where ||h|| = (h’h)'”” and o(||h|) > 0 as h > 0. We note that for small values 
of ||hl|, the sign of f(x, + h) — f(x.) depends on the value of (h’V)*f(x,). It 
follows that if 


i. (h’V)’f(x,) > 0, then f(x, +h) >f(x,) for all nonzero values of h in 
some neighborhood of 0. Thus x, is a point of local minimum of f. 
ii. (h’V)°f(x,) <0, then f(x, +h) <f(x,) for all nonzero values of h in 
some neighborhood of 0. In this case, xq is a point of local maximum 
of f. 
iii. (h’V)*f(x,) changes sign inside a neighborhood of 0, then x, is neither 
a point of local maximum nor a point of local minimum. Therefore, x, 
must be a saddle point. g 


We note that (h' V) f(x) can be written as a quadratic form of the form 
h’Ah, where A = H p&o) is the n Xn Hessian matrix of f evaluated at xq, that 
is, 


fa fre Hr Fin 
N A a mee bs l (7.34) 
Te Taaa a 


where for simplicity we have denoted d*f(x9)/dx; ôx; by fip i,j =1,2,...,n 
[see formula (7.21)]. 


Corollary 7.7.1. Let f be the same function as in Theorem 7.7.1, and let 
A be the matrix given by formula (7.34). If x, is a stationary point of f, then 
at x, f has the following: 


i. A local minimum if A is positive definite, that is, the leading principal 
minors of A (see Definition 2.3.6) are all positive, 


fu fr 


fe of Utes: det(A) >0. (7.35) 


fu > 9, se| 


ii. A local maximum if A is negative definite, that is, the leading principal 
minors of A have alternating signs as follows: 


fu fr 


fa fz >0,...,(—1) det(A) >0. (7.36) 


fi <9, ci 


iii. A saddle point if A is neither positive definite nor negative definite. 


286 MULTIDIMENSIONAL CALCULUS 
Proof. 


i. By Theorem 7.7.1, f has a local minimum at x, if (h’V)*f(x,)) = h’Ah 
is positive for all h #0, that is, if A is positive definite. By Theorem 
2.3.12(2), A is positive definite if and only if its leading principal 
minors are all positive. The conditions stated in (7.35) are therefore 
sufficient for a local minimum at x). 

ii. (h’V)*f(x,) <0 if and only if A is negative definite, or —A is positive 
definite. Now, a leading principal minor of order m (= 1,2,...,n) of 
—A is equal to (—1)” multiplied by the corresponding leading princi- 
pal minor of A. This leads to conditions (7.36). 

iii. If A is neither positive definite nor negative definite, then (h’V)7f(xy) 
must change sign inside a neighborhood of x). This makes x, a saddle 
point. o 


A Special Case 


If f is a function of only n = 2 variables, x, and x,, then conditions (7.35) 
and (7.36) can be written as: 


i fu > 0, fufa —f > 0 for a local minimum at x,. 
ii. fı <0, fı fo — fh > 0 for a local maximum at x». 


If fi; foo —fġ <0, then x, is a saddle point, since in this case 


a’ f(x o’ f(x o’ f(x 
ranan TO gj, T g 
Oxy OX, OX ðx5 

a F(x) 


ax? (A, — ah,)(h, — bh), 


where ah, and bh, are the real roots of the equation h’Ah = 0 with respect 
to h,. Hence, h’Ah changes sign in a neighborhood of 0. 
If fi; foo — fs = 0, then h’Ah can be written as 


d?f(xq) age A°f(Xq)/ Ax, AX, | 


h’Ah = 
Ox? O° f(xo)/ 2x? 


provided that d7f(x))/dxj #0. Thus h'Ah has the same sign as that of 
ð’ f(x,)/ðx? except for those values of h = (h,,h,)' for which 


d°f(Xq)/ AX, öx 
O° f (Xo) /OXt 


h, +h, 


2 
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in which case it is zero. In the event d*f(x,)/dxj =0, then d7f(x9)/dx, dx, 
=0, and h'Ah =h} d7f(x.)/dx3, which has the same sign as that of 
O° f(xy)/ax3, if it is different from zero, except for those values of h = 
(h,,h,)' for which h, = 0, where it is zero. It follows that when fi fa -fh = 
0, h’Ah has a constant sign for all h inside a neighborhood of 0. However, it 
can vanish for some nonzero values of h. For such values of h, the sign of 
f(x, +h) —f(x)) depends on the signs of higher-order partial derivatives 
(higher than second order) of f at x. These can be obtained from Taylor’s 
expansion. In this case, no decision can be made regarding the nature of the 
stationary point until these higher-order partial derivatives (if they exist) have 
been investigated. 


EXAMPLE 7.7.1. Let f: R? > R be the function f(x,, x.) =x? + 2x} — x4. 
Consider the equations 


of 

— =2x,-1=0, 
Ox, 

ð 

a, =4x,=0 

OX, 


The only solution is x, = (0.5, 0)’. The Hessian matrix is 


asf sh 


0 4 

which is positive definite, since 2>0 and det(A) = 8 > 0. The point x, is 
therefore a local minimum. Since it is the only one in R?, it must also be the 
absolute minimum. 


EXAMPLE 7.7.2. Consider the function f: R? > R, where 


i KES.. 2 2 
F(X, X2, X3) = 3x] + 2x5 +x3— 2X xX, +3x1xX3 XX3 


— 10x; + 4x, — 6x, + 1. 
A stationary point must satisfy the equations 


of 


a =xj—2x,+3x;-10=0, (7.37) 
of 
—— = —-2x,+4x,+%,+4=0, (7.38) 
OX> 
of 
—— =3x,+x,+2x,-6=0. (7.39) 


OX, 
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From (7.38) and (7.39) we get 
x, =x- 2, 
X3=4-2x,. 
By substituting these expressions in equation (7.37) we obtain 
x7 —8x,+6=0. 


This equation has two solutions, namely, 4— /10 and 4 + V10. We therefore 
have two stationary points, 


x = (4 + V10 ,2 + 710 ,— 4 — 2710)’, 
x® = (4 — V10 ,2 — V10 ,— 4 + 2V10 V’. 
Now, the Hessian matrix is 
2x, —2 3 
A=|-2 4 ıb 
3 1 2 


Its leading principal minors are 2x,,8x, — 4, and 14x, — 56. The last one is 
the determinant of A. At x‘) all three are positive. Therefore, x{ is a point 
of local minimum. At x the values of the leading principal minors are 1.675, 
2.7018, and — 44.272. In this case, A is neither positive definite over negative 
definite. Thus x‘? is a saddle point. 


7.8. THE METHOD OF LAGRANGE MULTIPLIERS 


This method, which is due to Joseph Louis de Lagrange (1736-1813), is used 
to optimize a real-valued function f(x,, x2,...,X,), Where x1,X,...,%, are 
subject to m (<n) equality constraints of the form 


gal Xis X25... Xp) =, 
ga( X1, X23... Xp) = 0, 
2(x%1; X2 ) i (7.40) 
Emh X15 X23--- Xn) =O, 


where gj, 25,---, 8m are differentiable functions. 
The determination of the stationary points in this constrained optimization 
problem is done by first considering the function 


F(x) = f(x) + E A,8)(%), (7.41) 
j=l 
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where x=(x,,%5,...,x,)’ and A,,A,,...,A,, are scalars called Lagrange 
multipliers. By differentiating (7.41) with respect to x,,x,,...,x, and equat- 
ing the partial derivatives to zero we obtain 


OF ðf yeah ôg; 
a a te A,— =0, j=1,2,...,n. 7.42 
OX; Ox, 2 ' a ) 


Equations (7.40) and (7.42) consist of m +n equations in m +n unknowns, 
namely, x4, X25... Xp; Àj; Az,---,A,,- The solutions for x,,x,,...,x, deter- 
mine the locations of the stationary points. The following argument explains 
why this is the case: 

Suppose that in equation (7.40) we can solve for m xs, for example, 
Xis X23.. <, Xm» in terms of the remaining n — m variables. By Theorem 7.6.2, 
this is possible whenever 


ð( 81; 825-3 ee 


+0. (7.43) 
nA E E ETE FA) 
In this case, we can write 

X1 = Wy Xmas m42 Xn)» 

X2 hy ( Xi sXmageeres Xn) 
(7.44) 

Xm Sha Xin cia Xmt2> Xn). 
Thus f(x) is a function of only n — m variables, namely, Xm41Xm4+2> +- Xn: 


If the partial derivatives of f with respect to these variables exist and if f has 
a local optimum, then these partial derivatives must necessarily vanish, 
that is, 


af æ of öh, 


+ — — =0, i=m+1,m+2,...,n. (7.45) 
OX; j=, ON; Ox; 


Now, if equations (7.44) are used to substitute h,,,,...,,, for 1,7... Xm> 
respectively, in equation (7.40), then we obtain the identities 


Ny ha,-. -Ams Xm41>X%mt29+++9 Xn 


)=0 
oO ee has... hms Xm41>Xm42>++ Xn) =0, 


Bi Ns has... hms Gets Xm42> Xn) =0. 
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By differentiating these identities with respect to X,,41,Xm42.+++>X,_, We 
obtain 


ð m- g oh, 
am Poe i i=m+1,m+2,...,n;k=1,2,...,m. (7.46) 
OX; ja hj Ox, 


Let us now define the vectors 


ð ð dg, \' 
ò, = 2 , = gh 28 #25 Ek > k=1,2,.. sM, 
ÓX mat OX maz IX, 
8k 98, dg, \ 
= ; pits ea 5 k=1,2,...,m, 
á (5 ðh, My, 
ðh; ðh; ðh; \ 
n; = > PEF ri j=1,2,. sM, 
ÓX mat OX maz OX, 
óf óf af \ 
y= | ð pi baa 3 | 2 
Xm+1 OX m4 OX n 
of of of \' 
dh,’ dh,” ðh, } 


Equations (7.45) and (7.46) can then be written as 


[8,:5,:-:8m]+[n): qo: °°: 9, JP = 90, (7.47) 
w+ [mi ae a] t= 9, (7.48) 
where I =[y,: Y2: Ym], which is a nonsingular m X m matrix if condition 


(7.43) is valid. From equation (7.47) we have 
[nin i a] = —[8,:5,:°°°:8,, JP. 
By making the proper substitution in equation (7.48) we obtain 
w+ [5,:5,:°°°:5,, ]A=0, (7.49) 
where 


h=-T's. (7.50) 
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Equations (7.49) can then be expressed as 


— + YA =0, i=m+i,mt2,...,n. (7.51) 


— +  PYa—t=0,  i=1,2,...,m. (7.52) 


Equations (7.51) and (7.52) can now be combined into a single vector 
equation of the form 


Vi(x) + A, Vg; =0, 
j=l 


which is the same as equation (7.42). We conclude that at a stationary point 
of f, the values of x,,x,,..., x, and the corresponding values of A,, A,,...,A 
must satisfy equations (7.40) and (7.42). 


m 


Sufficient Conditions for a Local Optimum 
in the Method of Lagrange Multipliers 


Equations (7.42) are only necessary for a stationary point x, to be a point of 
local optimum of f subject to the constraints given by equations (7.40). 
Sufficient conditions for a local optimum are given in Gillespie (1954, pages 
97-98). The following is a reproduction of these conditions: 

Let x, be a stationary point of f whose coordinates satisfy equations (7.40) 
and (7.42), and let A,, A5,..., A,, be the corresponding Lagrange multipliers. 
Let F;; denote the second-order partial derivative of F in formula (7.41) with 
respect to x;, and x, i,j =1,2,...,n; i#j. Consider the (m+n) X(m +n) 
matrix 


Fy Fo Fin gi? gP go 
Fa Fy Fon gP gP ie 
F; F, ] m g9 9 
ad ra tee See Sm | (7.53) 
Bi 81 81 0 0 0 
gD g? gs? 0 0 ine 0 
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where g® = dg,/dx;,i=1,2,...,n; j=1,2,...,m. Let A, denote the deter- 
minant of B,. Furthermore, let A,,A;,...,A,_,, denote a set of principal 
minors of B, (see Definition 2.3.6), namely, the determinants of the principal 
submatrices B,,B,,...,B where B; is obtained by deleting the first i — 1 


»Pn-m? 
rows and the first i— 1 columns of B, (i= 2,3,...,n — m). All the partial 
derivatives used in B,,B,,...,B,_,, are evaluated at x). Then sufficient 


conditions for x, to be a point of local minimum of f are the following: 


i. If m is even, 


ii. If m is odd, 


However, sufficient conditions for x, to be a point of local maximum are the 
following: 


i. If n is even, 

A,>0, A,<0,...,  (-1)" "A 
ii. If n is odd, 

MSO. CASO Sy "A,m 0. 


EXAMPLE 7.8.1. Let us find the minimum and maximum distances from 
the origin to the curve determined by the intersection of the plane x, +x, = 0 
with the ellipsoid x? + 2x3 +xj+2x,x,=1. Let f(x, x», x3) be the squared 
distance function from the origin, that is, 


F( X41, X2, X3) =x? +x +3. 

The equality constraints are 

gi( X1; X2, X3) =x, + x3 =0, 

ga( Xis X2, X3) =X? + 2x} +x +2x,x;—1=0. 
Then 

F(x1, X3, X3) =x? +x3 +x} + A(x, +x3) 
+ A(x? + 2x3 +x5 Oy he 1), 
OF 


Ox, =2x, +2X,x,=0, (7.54) 
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OF 

— =2x, +A, +2A,(2x, +3) =0, (7.55) 

OX> 

OF 

ers =2x,+A,+2A,(x, +23) =0. (7.56) 
3 


Equations (7.54), (7.55), and (7.56) and the equality constraints are satisfied 
by the following sets of solutions: 


Il. x, =0,x, = —-1,x,;=1,A, = —2,A, = —-2. 
II. x, =1, x, =0, x, = 0, A, =0, A, = —-1. 
IV. x, = —1,x, =0,x,=0,A, =0,A,= —-1. 


To determine if any of these four sets correspond to local maxima or 
minima, we need to examine the values of A4, A,,...,A,,_,,- Here, the matrix 
B, in formula (7.53) has the value 


2+2A, 0 0 0 2X, 
0 2+4A, 2A, 1 4x,+2x, 
B,= 0 2X, 2+2A, 1 2x,+2x, 
0 1 1 0 0 
2X4 4x,+2x, 2x ,+2x,;, 0 0 


Since n =3 and m= 2, only one A,, namely, A}, the determinant of B,, is 
needed. Furthermore, since m is even and n is odd, a sufficient condition for 
a local minimum is A, > 0, and for a local maximum the condition is A, < 0. 
It can be verified that A, = —8 for solution sets I and II, and A, =8 for 
solution sets III and IV. We therefore have local maxima at the points 
(0,1,— 1) and (0,— 1,1) with a common maximum value f,,,, =2. We also 
have local minima at the points (1,0,0) and (—1,0,0) with a common 
minimum value fin = 1. Since these are the only local optima on the curve of 
intersection, we conclude that the minimum distance from the origin to this 
curve is 1 and the maximum distance is y2. 


7.9. THE RIEMANN INTEGRAL OF A MULTIVARIABLE FUNCTION 


In Chapter 6 we discussed the Riemann integral of a real-valued function of 
a single variable x. In this section we extend the concept of Riemann 


integration to real-valued functions of n variables, x,, x ,...,X,- 
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Definition 7.9.1. The set of points in R” whose coordinates satisfy the 
inequalities 


a; <x; <b; i=l Zeny (7.57) 
where a; <b;, i=1,2,...,n, form an n-dimensional cell denoted by c,,(a, b). 
The content (or volume) of this cell is []7_,(b;—a;) and is denoted by 


ulc,(a,b)I. 
Suppose that P, is a partition of the interval [a,,b,], i= 1,2,...,n. The 


Cartesian product P= X'_, P, is a partition of c,(a,b) and consists of 
n-dimensional subcells of c,,(a, b). We denote these subcells by S,,55,...,S,. 
The content of S, is denoted by u(S;), i= 1,2,...,v, where v is the number 


of subcells. o 


We shall first define the Riemann integral of a real-valued function f(x) 
on an n-dimensional cell; then we shall extend this definition to any bounded 
region in R”. 


7.9.1. The Riemann Integral on Cells 


Let f: D >R, where D CR". Suppose that c,(a, b) is an n-dimensional cell 
contained in D and that f is bounded on c,(a, b). Let P be a partition of 
c,(a, b) consisting of the subcells $4, S2,...,S,. Let m; and M, be, respec- 


tively, the infimum and supremum of f on S, i=1,2,...,v. Consider the 
sums 
LSp(f) = Lim; w(S;), (7.58) 
i=1 
US»>(f) = } M, w(S;). (7.59) 
i=1 


We note the similarity of these sums to the ones defined in Section 6.2. As 
before, we refer to LS (f) and US,(f) as the lower and upper sums, 
respectively, of f with respect to the partition P. 

The following theorem is an n-dimensional analogue of Theorem 6.2.1. 
The proof is left to the reader. 


Theorem 7.9.1. Let f: D > R, where DCR”. Suppose that f is bounded 
on c,(a,b) CD. Then f is Riemann integrable on c,(a, b) if and only if for 
every € > 0 there exists a partition P of c,(a, b) such that 


USp(f) - LSp(f) < €. 


Definition 7.9.2. Let P} and P, be two partitions of c,(a,b). Then P, is 
a refinement of P, if every point in P} is also a point in P,, that is, P, C P}. 
m 


THE RIEMANN INTEGRAL OF A MULTIVARIABLE FUNCTION 295 


Using this definition, it is possible to prove results similar to those of 
Lemmas 6.2.1 and 6.2.2. In particular, we have the following lemma: 


Lemma 7.9.1. Let f: D >R, where D C R”. Suppose that f is bounded 
on c,(a,b) CD. Then supp LS,(f) and infp USp( f) exist, and 
sup LS (f) < infUS)(f). 
P P 
Definition 7.9.3. Let f: c,(a,b)—R be a bounded function. Then f is 
Riemann integrable on c,(a, b) if and only if 
sup LSp(f) = inf USp(f). (7.60) 
P P 
Their common value is called the Riemann integral of f on c,(a,b) and is 


denoted by Je (4,5) f(x)dx. This is equivalent to the expression fpi fg? ++ 
Jen F(X, X23- +, X„) dx, dx, ++ dx„. For example, for n = 2,3 we have 


Lg pf ax a [°F x2) de, dey, (7.61) 


È 


by rba fbs 
f i we? dx= f f T f(x1, X2, X3) dx, dx, dey. (7.62) 
c3(a, a, “a “a3 


The integral in formula (7.61) is called a double Riemann integral, and the 
one in formula (7.62) is called a triple Riemann integral. In general, for 
n> 2, fea pf) dx is called an n-tuple Riemann integral. o 


The integral J (a,b) f(x) dx has properties similar to those of a single-varia- 
ble Riemann integral in Section 6.4. The following theorem is an extension of 
Theorem 6.3.1. 


Theorem 7.9.2. If f is continuous on an n-dimensional cell c,,(a, b), then 
it is Riemann integrable there. 


7.9.2. Iterated Riemann Integrals on Cells 


The definition of the n-tuple Riemann integral in Section 7.9.1 does not 
provide a practicable way to evaluate it. We now show that the evaluation of 
this integral can be obtained by performing n Riemann integrals each of 
which is carried out with respect to one variable. Let us first consider the 
double integral as in formula (7.61). 


Lemma 7.9.2. Suppose that f is real-valued and continuous on c,(a, b). 
Define the function g(x,) as 


g(x) = S x2) dx}. 


Then g(x,) is continuous on [a,, by]. 
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Proof. Let €>0 be given. Since f is continuous on c,(a,b), which is 
closed and bounded, then by Theorem 7.3.2, f is uniformly continuous on 
c(a, b). We can therefore find a ô> 0 such that 


M8) fm <5 


if ll- wll < ô, where €=(x,, x2)’, n = (yı, yo)’, and x1, y, E [a, b1], x2, Y2 
E€ [a,, bn]. It follows that if |y, —x,| < 6, then 


le (y2) — g(x) | = MOESA -= f(x, x2)] dx, 
= oe = f(x, x2) |dx, 


< r PET ae (7.63) 
since |l(x,, y2) — (xy, x2)'ll = ly. —x.| < 6. From inequality (7.63) we con- 
clude that 

|s(y2) =8(x2)|< € 
if |y, —x,| <6. Hence, g(x,) is continuous on [a,, b,]. Consequently, from 


Theorem 6.3.1, g(x,) is Riemann integrable on [a,, b,], that is, Ja2g(xy) dx, 
exists. We call the integral 


[s(x a=) [Fons dr |ds; (7.64) 


an iterated integral of order 2. o 


The next theorem states that the iterated integral (7.64) is equal to the 
double integral f, ça, b) f(x) dx. 


Theorem 7.9.3. If f is continuous on c,(a, b), then 


Lf ax = f| ffs) dei |d 


Proof. Exercise 7.22. o 


We note that the iterated integral in (7.64) was obtained by integrating 
first with respect to x,, then with respect to x,. This order of integration 
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could have been reversed, that is, we could have integrated f with respect to 
x, and then with respect to x,. The result would be the same in both cases. 
This is based on the following theorem due to Guido Fubini (1879-1943). 


Theorem 7.9.4 (Fubini’s Theorem). If f is continuous on c,(a, b), then 


f f(x) dx= PY Poon) dr |as = f| Psu) dra | dx, 


x(a, b) 


Proof. See Corwin and Szczarba (1982, page 287). o 


A generalization of this theorem to multiple integrals of order n is given 
by the next theorem [see Corwin and Szczarba (1982, Section 11.1)]. 


Theorem 7.9.5 (Generalized Fubini’s Theorem). If f is continuous on 
c,(a, b) = {xla; <x; < b; i=1,2,..., n}, then 


J f(x) dx= |. „rŒ de | dxo i=1,2,...,n, 
c c (a, a; 


na, b) 
where dxa = dx, dx, ++ dx;_, dx;,, dx, and c® (a,b) is an (n—1)- 
dimensional cell such that a, <x, <b,,a,<x,<b),...,a;_, SXi S 


biis Qin SX j 41 SO jn 566 5d, SX, < Bpo 


7.9.3. Integration over General Sets 


We now consider n-tuple Riemann integration over regions in R” that are 
not necessarily cell shaped as in Section 7.9.1. 

Let f: D—R be a bounded and continuous function, where D is a 
bounded region in R”. There exists an n-dimensional cell c,(a, b) such that 
Dcc,{a, b). Let g: c,(a,b) > R be defined as 


_{f(x), x€D, 
s= [f xD. 


Then 


J 


g(x) dx= | f(x) dx. (7.65) 
c,(a, b) D 
The integral on the right-hand side of (7.65) is independent of the choice of 
c,(a, b) provided that it contains D. It should be noted that the function g(x) 
may not be continuous on Br(D), the boundary of D. This, however, should 
not affect the existence of the integral on the left-hand side of (7.65). The 
reason for this is given in Theorem 7.9.7. First, we need to define the 
so-called Jordan content of a set. 
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Definition 7.9.4. Let DCR” be a bounded set such that D Cc,(a, b) for 
some n-dimensional cell. Let the function Ap: R” > R be defined as 


_ fi, xeD, 
wiy xD. 
This is called the characteristic function of D. Suppose that 


sup LSp(Ap) = inf USp(Ap), (7.66) 
P P 


where LSp(Ap) and USp(àÀp) are, respectively, the lower and upper sums of 
Ap(x) with respect to a partition P of c,(a,b). Then, D is said to have an 
n-dimensional Jordan content denoted by u;(D), where u;(D) is equal to the 
common value of the terms in equality (7.66). In this case, D is said to be 
Jordan measurable. o 


The proofs of the next two theorems can be found in Sagan (1974, Chapter 
11). 


Theorem 7.9.6. A bounded set D C R” is Jordan measurable if and only 
if its boundary Br(D) has a Jordan content equal to zero. 


Theorem 7.9.7. Let f: D—R, where DCR” is bounded and Jordan 
measurable. If f is bounded and continuous in D except on a set that has a 
Jordan content equal to zero, then fp f(x) dx exists. 


It follows from Theorems 7.9.6 and 7.9.7 that the integral in equality (7.75) 
must exist even though g(x) may not be continuous on the boundary Br(D) 


of D, since Br(D) has a Jordan content equal to zero. 


EXAMPLE 7.9.1. Let f(x,,x,)=x,x, and D be the region 


D = {( x1, x2)lx? +x2 < 1, x, > 0, x, = 0}. 
It is easy to see that D is contained inside the two-dimensional cell 


c,(0,1) = {(x,,x,)|0 <x, < 1,0 <x, < 1}. 


Then 


1 (—x?2)!? 
J f xx, dx, dey = f | ; ny de, | dr, 
D 0 0 


(0 
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We note that for a fixed x, in [0,1], the part of the line through (x,, 0) that 
lies inside D and is parallel to the x,-axis is in fact the interval 0 <x, < 
(1 —x?)'”*. For this reason, the limits of x, are 0 and (1 —x7)'/?. Conse- 


quently, 
aa X1x wd dey zfs a EE ae, | dx, 


re 2 
= af x — x7) dx; 
1 
ge 

In practice, it is not always necessary to make reference to c,(a, b) that 
encloses D in order to evaluate the integral on D. Rather, we only need to 
recognize that the limits of integration in the iterated Riemann integral 
depend in general on variables that have not yet been integrated out, as was 
seen in Example 7.9.1. Care should therefore be exercised in correctly 
identifying the limits of integration. By changing the order of integration 


(according to Fubini’s theorem), it is possible to facilitate the evaluation of 
the integral. 


EXAMPLE 7.9.2. Consider f fp e% dx, dx,, where D is the region in the 
first quadrant bounded by x, = 1 and x, =x,. In this example, it is easier to 
integrate first with respect to x, and then with respect to x,. Thus 


EXAMPLE 7.9.3. Consider the integral f fp(x? + x3) dx, ae where D is a 
region in the first quadrant bounded by x, =x? and x, =x}. Hence, 


Silene ale i 
E -3 ) + (yx —x$)x3| dx, 


959 


~~ 4680 * 


7.9.4. Change of Variables in n-Tuple Riemann Integrals 


In this section we give an extension of the change of variables formula in 
Section 6.4.1 to n-tuple Riemann integrals. 
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Theorem 7.9.8. Suppose that D is a closed and bounded set in R”. Let 
f: D >R be continuous. Suppose that h: D > R” is a one-to-one function 
with continuous first-order partial derivatives such that the Jacobian determi- 
nant, 


(h, ha,..., hp) 
det = ; 
et[Ja(x)] O( Hig Xay., Xn) 
is different from zero for all x in D, where x = (x4, X3,..., X„)' and h; is the 


ith element of h(i = 1,2,..., n). Then 


f f(x) dx = f f[g(u)]|det J,(u) |du, (7.67) 
D D' 
where D’ = h(D),u = h(x), g is the inverse function of h, and 
ð( 81> §2> ayy) 
det = ; 7.68 
et Ie) Ol Uy, Wy xing Up) ( ) 
where g, and u; are, respectively, the ith elements of g and u (i = 1,2,..., n). 


Proof. See, for example, Corwin and Szczarba (1982, Theorem 6.2), or 
Sagan (1974, Theorem 115.1). oO 


EXAMPLE 7.9.4. Consider the integral f {,)x,xjdx,dx,, where D is 
bounded by the four parabolas, x} =x,, x3 = 3x1, x? = x3, X? = 4x,. Let u; = 
X3/X1, U) =x?/x,. The inverse transformation is given by 


23 


x, = (uu? N 


> X= (uju, 
From formula (7.68) we have 


(8158) = (x1, X2) | eesti 


(uj, uz) (Uy, Uy) cn 


By applying formula (7.67) we obtain 


J fv dx, dx, = af fo mie? du, duz, 


where D’ is a rectangular region in the uju, space bounded by the lines 
u = 1,3; u, = 1,4. Hence, 


3 4 
J [pv dx, dx, = af us? duf u$’ du, 


= £03 -1)(4 - 1). 
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7.10. DIFFERENTIATION UNDER THE INTEGRAL SIGN 


Suppose that f(x), x5,...,%,,) is a real-valued function defined on D CR”. If 
some of the x,’s, for example, X m41, Xm+2>:-- X, (n >m), are integrated out, 
we obtain a function that depends only on the remaining variables. In this 
section we discuss conditions under which the latter function is differen- 
tiable. For simplicity, we shall only consider functions of n = 2 variables. 


Theorem 7.10.1. Let f: D —> R, where D CR? contains the two-dimen- 
sional cell c,(a, b) = {((x,, x, la, <x; <b), a, <x, <b,}. Suppose that f is 
continuous and has a continuous first-order partial derivative with respect to 
x, in D. Then, for a, <x, <b), 


bı Of (1, X2) dx 


d eb 
Td, Oord das f ax, 1° (7.69) 
Proof. Let h(x,) be defined on [a,, b,] as 
b, Of (%1,X2) 
h(a.) = J Ix; dx,, a, SX, < b3. 


Since df/dx, is continuous, then by Lemma 7.9.2, h(x,) is continuous on 
[a,, b,]. Now, let t be such that a,<t<b,. By integrating h(x,) over the 
interval [a,,t] we obtain 


f'n) arm fi EEE an ans, (7.70) 


a| `a ðX 


The order of integration in (7.70) can be reversed by Theorem 7.9.4. We than 
have 


[nan an= f°) f FE an a 


ay a2 OX, 


= f'n) — f(x, a2)] dx, 


1 


= S and) dx, — [fen a) dx, 


=F(t) —F(a), (7.71) 
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where F(y) = sef, y) dx,. If we now apply Theorem 6.4.8 and differenti- 
ate the two sides of (7.71) with respect to t we obtain h(t) = F'(t), that is, 


oO -5f 'f(x,,t) dxi. (7.72) 


ay 


Formula (7.69) now follows from formula (7.72) on replacing t with x,. o 


Theorem 7.10.2. Let f and D be the same as in Theorem 7.10.1. 
Furthermore, let A(x,) and @(x,) be functions defined and having continu- 
ous derivatives on [a,,b,] such that a, < A(x,) < 0(x,) <b, for all x, in 
[a,,b,]. Then the function G: [a,,b,]— R defined by 


Gn) = f" fF 2) de 


is differentiable for a, <x, <b,, and 


dG ee eres 
~ = J“ ») A ( ) he, + 0'(x2)f10(x2), x2] — A’ (x2) fF LACx2), x2]. 


(x2) OXy 


Proof. Let us write G(x,) as H(A, 0, x). Since both of A and @ depend on 
X,, then by applying the total derivative formula [see formula (7.12)] to H we 
obtain 

dH ðH dà óH d0 ðH 


= + + 
dx, oO dx,  ð0 dx, ðx 


(7.73) 


Now, by Theorem 6.4.8, 


OH 
Z ffm) dr =f(9, x2), 

óH 

OA ICES dx, 


ð 
= al fev) dx, = —f(A, x3). 


Furthermore, by Theorem 7.10.1, 


OH 


Ox, 


o Of (x1, x2) dx 


Ff Rara) dx, = J 
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By making the proper substitution in formula (7.73) we finally conclude that 


aa) OF (41, X2) 


d O(x 
— [PP FO, x2) dey = f dx, + 0"( x2) f[ (42), x2] 


Ax Xx) a) OX 
=A (x2)f[A(x2), x2]. z 
EXAMPLE 7.10.1. 
a ne sew B 
ade (xıx3 — 1) e™ dx; = bs 2x1x, e™™ dx; 


— sin X4( x5 COS x, — 1) e008 Xo 
a 2x,(x3 = 1) ent, 


Theorems 7.10.1 and 7.10.2 can be used to evaluate certain integrals of the 
form [?f(x) dx. For example, consider the integral 


f= 2 cos xdx. 
i 
Define the function 
F(x) =f cos( x1x3) dx, 
0 


where x, > 1. Then 


XET 


1 
= zo TX). 


1 
F(x,) = sin 4142) 
2 x, =0 


If we now differentiate F(x,) two times, we obtain 


2sin(wx,) — 27x, cos(mx) — 7x3 sin( mx) 
F" (x2) = 33 ; 
2 


Thus 


E OEE 27x, cos(mx) + mx} sin( mx) — 2sin( mx) 
i 1x2) dx; f 
0 


3 
x4 
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By replacing x, with 1 we obtain 


T 
I=) x? cos x, dx, = — 27. 
0 


7.11. APPLICATIONS IN STATISTICS 


Multidimensional calculus provides a theoretical framework for the study of 
multivariate distributions, that is, joint distributions of several random vari- 
ables. It can also be used to estimate the parameters of a statistical model. 
We now provide details of some of these applications. 

Let X = (X4, X5,..., X,,)' be a random vector. The distribution of X is 
characterized by its cumulative distribution function, namely, 


F(x) =P(X, <x,,X,<x,,...,X, <X,), (7.74) 
where x = (x4, X2, ..., X„)'. If F(x) is continuous and has an nth-order mixed 
partial derivative with respect to x,,X5,...,x,, then the function 

O"F (x) 
f(x) = , 


OX, OX, *** OX, 


is called the density function of X. In this case, formula (7.74) can be written 
in the form 


F(x) =f" f o f IO a. 


where Z = (Z4, Z2,.--, Z„)'. If the random variable X, (i= 1,2,...,n) is con- 
sidered separately, then its distribution function is called the ith marginal 
distribution of X. Its density function f;(x;), called the ith marginal density 
function, can be obtained by integrating out the remaining n — 1 variables 
from f(x). For example, if X = (X,, X,)’, then the marginal density function 
of X, is 


fix) = f f(x; x2) dy. 
Similarly, the marginal density function of X, is 


f(x) = E dx. 
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In particular, if X,, X,,..., X, are independent random variables, then the 
density function of X = (X,, X,,..., X,,)’ is the product of all the associated 
marginal density functions, that is, f(x) = TI; f,(x)). 

If only n—2 variables are integrated out from f(x), we obtain the 
so-called bivariate density function of the remaining two variables. For 
example, if X =(X,, X,, X3, X,)’, the bivariate density function of x, and 
X, is 


f(x, x2) =f f: f(x1, X2, X3, X4) dx; dx4. 


Now, the mean of X =(X,, X3, ..., X )' is p = ( u, M2,- --, Mp)’, Where 
m=f xfx) dx  i=1,2,...;n. 


The variance—covariance matrix of X is the n Xn matrix È} = (o; p, where 


0j = J J C = Bi) (x; Hi) fyl xi x;) dx; dx, 


where u; and u; are the means of X; and Xj, respectively, and f;;(x;, x;) is 


the bivariate density function of X; and X, 1 #j. If i=j, then o; is the 
variance of X,, where 


=f (x; — m) f(x) dx, i=1,2,...,n. 


7.11.1. Transformations of Random Vectors 


In this section we consider a multivariate extension of formula (6.73) regard- 
ing the density function of a function of a single random variable. This is 
given in the next theorem. 


Theorem 7.11.1. Let X be a random vector with a continuous density 
function f(x). Let g: D > R”, where D is an open subset of R” such that 
P(X €D)=1. Suppose that g satisfies the conditions of the inverse function 
theorem (Theorem 7.6.1), namely the following: 


i. g has continuous first-order partial derivatives in D. 
ii. The Jacobian matrix J 0 is nonsingular in D, that is, 


ð E E Ri 
aet[J,(x)] = (81 829+++ 8n) a 


O (Ris Xap ees5 By) 


for all x € D, where g, is the ith element of g (i= 1,2,..., 7). 
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Then the density function of Y = g(X) is given by 


h(y) =f [g~'(y)]|det[J,-.(y) | 


is the inverse function of g. 


ri 


where g`! 


Proof. By Theorem 7.6.1, the inverse function of g exists. Let us therefore 
write X = g 1(Y). Now, the cumulative distribution function of Y is 


A(y) =P g,(X) <)V1, 8>(X) SY25... 8a (X) <yn] 


= D f(x) dx, (7.75) 


where A, ={x € D\g,(x) <y;, i=1,2,...,n}. If we make the change of vari- 
able w = g(x) in formula (7.75), then, by applying Theorem 7.9.8 with g~ '(w) 
used instead of g(u), we obtain 


dx = ee d -1 dw, 
J, Fe x J, fle (w) | |det[J,-:(w) | |dw 
where B, = g(A,,) = {g(x)|g,@) <y; i= 1,2,..., n}. Thus 
Yi 2 Ya ap 4 
A(y)y=f ff “fle *()] |det[J,.(w)] [aw 
It follows that the density function of Y is 


seer ares ee | 
(Yis Yas +--> Yn) 


h(y) =f le y)] ; (7.76) 


1 is the ith element of g`! (i = 1,2,..., n). m 


where g; 


EXAMPLE 7.11.1. Let X =(X,, X,)', where X, and X, are independent 
random variables that have the standard normal distribution. Here, the 
density function of X is the product of the density functions of X, and X,. 
Thus 


, OSH, X9 <, 


1 1 
f(x) = 5l- x (1 +x3) 


Let Y=(Y,, Y,)' be defined as 
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In this case, the set D in Theorem 7.11.1 is R°, g(x) =x, +x, g(x) =x, — 
2x2, 81 (=x, = 3y +2), 830) =x, = 301 — y2), and 


Rig ae 
(gi »8§2 ) Lis 
ó( y1, Y2) 


Hence, by formula (7.76), the density function of y is 


h $ 1 {2y +y Ok Yyyy 1 
= a — x 
O= 5 oP zl 3 Al 3 3 
1 TORF ; 
-= onr P ~ Gg (T t292 + 292) > =<; Y2 <9. 


EXAMPLE 7.11.2. Suppose that it is desired to determine the density 
function of the random variable V=X,+X,, where X,>0, X,>0, and 
X=(X,, X,)’ has a continuous density function f(x,,x,). This can be 
accomplished in two ways: 


i. Let Q(v) denote the cumulative distribution function of V and let g(v) 
be its density function. Then 


Q(v) = P(X, +X, <v) 
= f ff) dx, dx, 
where A = {((x,,x,)|x, > 0, x, > 0, x, +x, <v}. We can write Q(v) as 


v 
) 


o) =f 


U— x 
f, f(x; x2) dr, |da. 
If we now apply Theorem 7.10.2, we obtain 
dQ vð V—X2 
Oei Far 22) de] de, 
= f f(v x2, x2) drz. (7.77) 


ii. Consider the following transformation: 


Y, =X, +X, 
Y, =X. 
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Then 


By Theorem 7.11.1, the density function of Y=(Y,,Y,)’ is 


ð ? 
h(Y1, Y2) =f(¥1 32> Y2) o 
=f(Y1 -Y2 Y2) detl “i 


=f(¥1 Yo. Y2), V1 >Y2 20. 


By integrating y, out we obtain the marginal density function of 
Y, = V, namely, 


q(v) = [for =y2; Y2) dya = | f(v =x» x2) dx). 


This is identical to the density function given in formula (7.77). 


7.11.2. Maximum Likelihood Estimation 


Let X,, X,,..., X, be a sample of size n from a population whose distribu- 
tion depends on a set of p parameters, namely 04, 6,,..., 6,. We can regard 
this sample as forming a random vector X = (X,, X,,..., X,,)’. Suppose that 
X has the density function f(x,0), where x=(x,,x,,...,x,)’ and @= 
(01, 05,..-,9,)’. This density function is usually referred to as the likelihood 
function of X; we denote it by L(x, 0). E 

For a given sample, the maximum likelihood estimate of @, denoted by 0, 
is the value of @ that maximizes L(x, @). If L(x, 0) has partial derivatives with 


respect to 6,,65,...,6,, then © is often obtained by solving the equations 
ðL x,6 
a i=1,2,...,p. 
00; 


L 


In most situations, it is more convenient to work with the natural logarithm 
of L(x, 0); its maxima are attained at the same points as those of L(x, 0). 
Thus © satisfies the equation 


ð log| L(x, ô)] 
óð; E 


L 


0, i=1,2,...,p. (7.78) 


Equations (7.78) are known as the likelihood equations. 
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EXAMPLE 7.11.3. Suppose that X,, X,,...,X, form a sample of size n 


from a normal distribution with an unknown mean w and a variance o°. 


Here, 0 = ( u, 0°)’, and the likelihood function is given by 


1 1 2 A 
L(x,0) = Cr exp -zo bw) > 


Let L*(x, 0) = log L(x, 0). Then 
1 


L*(x, 6) = — Io? - 


n n 
È (x7 a} - 5 log(27o?), 
=1 


The likelihood equations in formula (7.78) are of the form 


oL* 1 2 ` 
an gee (7.79) 
* n 
—_ sgt È (aA se =0, (7.80) 
Equations (7.79) and (7.80) can be written as 
n(x — ~) =0, (7.81) 
Peri ara (7.82) 


where X¥=(1/n)X?_,x;. If n> 2, then equations (7.81) and (7.82) have the 
solution 


These are the maximum likelihood estimates of u and o’, respectively. 

It can be verified that 2 and 6? are indeed the values of u and o° that 
maximize L*(x, 0). To show this, let us consider the Hessian matrix A of 
second-order partial derivatives of L* (see formula 7.34), 


o? L* a? L* 
TE Op ðu Oo 2 
| æL a2L* 


Om da? da* 
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Hence, for w= ù and o? = ĉê?, 


ð?’ L” n 
ðu? Ge? 
ool di: 28 
Ju ðo? ~ aa L (x> À) =0, 
ð?’ L” n 
Jot 26" 


Thus 2L" / op? <0 and det(A) =n?/26° > 0. Therefore, by Corollary 7.7.1, 
(à, ĉ 2) is a point of local maximum of L*. Since it is the only maximum, it 
must also be the absolute maximum. 


Maximum likelihood estimators have interesting asymptotic properties. 


For more information on these properties, see, for example, Bickel and 
Doksum (1977, Section 4.4). 


7.11.3. Comparison of Two Unbiased Estimators 


Let X, and X, be two unbiased estimators of a parameter u. Suppose that 
X=(X,, X,) has the density function f(x,,x,), ~% <x, x, < %. To com- 
pare these estimators, we may consider the probability that one estimator, for 
example, X,, is closer to u than the other, X,, that is, 

p =P[IX, - wl < |X,- ul]. 


This probability can be expressed as 


p= J f fix) dri dep, (7.83) 


where D = {(x,, x)| |x; — wl < |x, — ul}. Let us now make the following 
change of variables using polar coordinates: 


Xx; — M=Prcos 6, x, — u=rsin 90. 


By applying formula (7.67), the integral in (7.83) can be written as 


p= ff geoj a 


drdé 
= J {se 0)rdrdo, 


(r, 6) 
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where g(r, 0)=f( u +r cos 0, u +r sin 0) and 
D'{(r 0)|0<r<%, 


In particular, if X has the bivariate normal density, then 


1 
f(%1,%2) = 172 


270,0,(1—p*) 
1 (4-4)  2p(xı- #)(22— u) 
2(1 — p°) 2 


Ti 0107 


xel- 


_ Gaaw) 


2 
072 


l = <x], X2 <, 


and 


1 


g(r, 0) = ———————]- 
ae 270,0,(1 — po 


2(1 — ø?) 2 


2 
Oo; 0107 07 


p? cos? 0 2pcosðsinð sin? 0 
x exp{ — — + ; 


where of and gf are the variances of X, and X,, respectively, and p is 


their correlation coefficient. In this case, 
37/4 a 
=2 r,0)rdr\ dé. 
p ie | f g(r, 0) 


It can be shown (see Lowerre, 1983) that 


1/2 
1 20,0,(1 — p°) 

p=1-— —Arctan > 5 (7.84) 
T o — OF 


if o, > œ. A large value of p indicates that X, is closer to u than X,, which 
means that X; is a better estimator of u than X,. 


7.11.4. Best Linear Unbiased Estimation 


Let X,, X,,...,X,, be independent and identically distributed random vari- 
ables with a common mean pu and a common variance o°. An estimator of 
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the form b= uj-,4;X;, where the a;’s are constants, is said to be a linear 
estimator of u. This estimator is unbiased if E() = Bs that is, if ©?_,a,=1, 
since E(X;) = u for i=1,2,...,n. The variance of @ is given by 


n 


Var( $) =0°} a. 


i=1 


The smaller the variance of ĝ, the more efficient ĝ is as an estimator of p. 
In particular, if @,,45,...,a, are chosen so that Var(@) attains a minimum 
value, then ¢ will have the smallest variance among all unbiased linear 
estimators of u. In this case, ¢ is called the best linear unbiased estimator 
(BLUE) of u. 

Thus to find the BLUE of u we need to minimize the function f = X'_,a? 
subject to the constraint L!_,a; = 1. This minimization problem can be solved 
using the method of Lagrange multipliers. Let us therefore write F [see 
formula (7.41)] as 


n n 
F=} a? +A Eai), 
i=1 i=1 
OF 
apo te i=1,2,...,n 
Hence, a;= —A/2 (i=1,2,...,n). Using the constraint L7_,a; = 1, we con- 
clude that A= —2/n. Thus a,;=1/n, i=1,2,...,n. To verify that this 
solution minimizes f, we need to consider the signs of A,,A,,...,A,_4, 


where A, is the determinant of B, (see Section 7.8). Here, B, is an (n + 1) x 
(n + 1) matrix of the form 
pn af bh 
hm yy, oO; 


It follows that 


n 
A, = det(B, ) = — a <0, 


n—1)2"71 
Eaa 7? <0, 


A,- = —2°<0. 


n 


Since the number of constraints, m = 1, is odd, then by the sufficient 
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conditions described in Section 7.8 we must have a local minimum when 
a;=1/n,i=1,2,...,n. Since this is the only local minimum in R”, it must be 
the absolute minimum. Note that for such values of a,,a,,...,a,, is the 
sample mean X,,. We conclude that the sample mean is the most efficient (in 
terms of variance) unbiased linear estimator of pm. 


7.11.5. Optimal Choice of Sample Sizes in Stratified Sampling 


In stratified sampling, a finite population of N units is divided into r 
subpopulations, called strata, of sizes N,,N,,...,N,. From each stratum a 
random sample is drawn, and the drawn samples are obtained independently 
in the different strata. Let n; be the size of the sample drawn from the ith 
stratum (i = 1,2,...,r). Let y;; denote the response value obtained from the 
jth unit within the ith stratum (i = 1,2,...,7; j =1,2,...,7,;). The population 
mean Y is 


ee a ae PES 2: 
F=- E Ly ENF, 
N Zi ja NG 
where Y, is the true mean for the ith stratum (i =1,2,...,r). A stratified 


estimate of Y is ¥, (st for stratified), where 


in which y; = (1/n X71 y;; is the mean of the sample from the ith stratum 
(i=1,2,...,r). If, in every stratum, y; is unbiased for Y, then y, is an 


unbiased estimator of Y. The variance of J, is 


LN? Var(5,). 


Var( Vs.) = 93 
N? i} 


Since y; is the mean of a random sample from a finite population, then its 
variance is given by (see Cochran, 1963, page 22) 


S? 
vangs Ca f= 1,2, 


where f,=n,/N,, and 
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Hence, 
ne 
Var( Fx) = Le —LiS7(1-fi), 
i=1 i 


where L; =N,/N (i =1,2,...,7). 

The sample sizes n,,,,...,, can be chosen by the sampler in an optimal 
way, the optimality criterion being the minimization of Var(y,) for a speci- 
fied cost of taking the samples. Here, the cost is defined by the formula 


P 
cost =c,+ } c;n;, 
i=1 


where c, is the cost per unit in the ith stratum (i = 1,2,...,r) and cọ is the 
overhead cost. Thus the optimal choice of the sample sizes is reduced to 
finding the values of ni, n,,...,n, that minimize L)_,(1/n,)L3S7( —f,) 
subject to the constraint 


Z cn; =d—Cp, (7.85) 
i=l 
where d is a constant. Using the method of Lagrange multipliers, we write 


F= È as} (l-f,) +A 


Lan tead] 


r 1 r 1 r 
= VY —Lis87- ¥ LS? +A Eene), 
i=1 "i iz1 Ni i=1 
Differentiating with respect to n; (i = 1,2,...,r), we obtain 
oF 1 
— =- S? +Ac,=0, i=1,2,...,7, 
ðn; n 


Thus 
n= (àc) 7 LS;  i=1,2,...,r. 
By substituting n; in the equality constraint (7.85) we get 


Eia yc LS; 


J= 
Z d-co 
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Therefore, 


= (d— co) NS; 
© VaL VENS 


n i=1,2,...,r. (7.86) 


It is easy to verify (using the sufficient conditions in Section 7.8) that the 
values of n;, n3,..., n, given by equation (7.86) minimize Var(y,,) under the 
constraint of equality (7.85). We conclude that Var(¥,,) is minimized when n; 
is proportional to (1/ Ve NS; (i =1,2,...,r). Consequently, n; must be 
large if the corresponding stratum is large, if the cost of sampling per unit in 
that stratum is low, or if the variability within the stratum is large. 
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EXERCISES 
In Mathematics 


7.1. Let f(x,,x,) be a function defined on R? as 


Ix, | Ix, +0 

——exp|-—-], x ; 
fx) = 3 i x3 ? 

0, x,=0. 


(a) Show that f(x,,x,) has a limit equal to zero as x = (x4, x,) > 0 
along any straight line through the origin. 
(b) Show that f(x, x,) does not have a limit as x > 0. 


7.2. Prove Lemma 7.3.1. 
7.3. Prove Lemma 7.3.2. 
7.4. Prove Lemma 7.3.3. 


7.5. Consider the function 


XiX 
aTe eae (x1; x2) # (0,0), 
142) 


0, (x1; x2) = (0,0). 


(a) Show that f(x,,x,) is not continuous at the origin. 
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7.6 


D 


7.7. 


7.8. 


7.9. 


(b) Show that the partial derivatives of f(x,, x.) with respect to x, and 
x, exist at the origin. 

[ Note: This exercise shows that a multivariable function does not have 

to be continuous at a point in order for its partial derivatives to exist at 

that point.] 


The function f(x,, x,,...,x,) is said to be homogeneous of degree n in 
X4,X,...,X, if for any nonzero scalar t, 


f(y, 2, ..., tep) = t'f (1, X23.. Xp) 


for all x=(x,,%5,...,x,)' in the domain of f. Show that if 
f(x, X2,- Xy) is homogeneous of degree n, then 


k of 
La 


i=1 i 


nf 


[ Note: This result is known as Euler’s theorem for homogeneous 
functions. ] 


Consider the function 


| rir x x x 0 0 
2 bA 2 2 
f ( is ) t 2 ( 1 2) ( ) 


0, (x1; x2) = (0,0). 


(a) Is f continuous at the origin? Why or why not? 


(b) Show that f has a directional derivative in every direction at the 
origin. 


Let S be a surface defined by the equation f(x)= cọ, where x= 
(x1; X5,---,X,)' and c, is a constant. Let C denote a curve on S given 
by the equations x, = g,(t), x, = g,(t),..., x, =g,(t), where 
gi 82- --, g; are differentiable functions. Let s be the arc length of C 
measured from some fixed point in such a way that s increases with t. 
The curve can then be parameterized, using s instead of f, in the form 
x, =h (s), x, = h,(s),..., x, =h,(s). Suppose that f has partial deriva- 
tives with respect to x4, X2,..., Xg- 

Show that the directional derivative of f at a point x on C in the 
direction of v, where v is a unit tangent vector to C at x (in the 
direction of increasing s), is equal to df/ds. 


Use Taylor’s expansion in a neighborhood of the origin to obtain a 
second-order approximation for each of the following functions: 
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(a) f(x,, x2) = exp(x, sin x). 
(b) f(x,, x3, x3) = sin(e™ +x2 + x3). 


(c) f(x,, x2) = cos(x; x5). 


. Suppose that f(x,,x,) and g(x,,x,) are continuously differentiable 


functions in a neighborhood of a point Xo = (x10, X59)’. Consider the 
equation u, =f(x,,x,). Suppose that df/dx, #0 at xo. 


(a) Show that 
Ox, of | of 
OX, OX, | OX, 


in a neighborhood of x9. 


(b) Suppose that in a neighborhood of xo, 
Hf, 8) 
ae 
ó( X15 x2) 
Show that 


og Ox, og 


OX, OX, OX, 


that is, g is actually independent of x, in a neighborhood of x. 
(c) Deduce from (b) that there exists a function ¢: D—R, where 
D CR is a neighborhood of f(x,), such that 


8( X41, X2) = b[ f(x x2)] 


throughout a neighborhood of x,y. In this case, the functions f and 
g are said to be functionally dependent. 
(d) Show that if f and g are functionally dependent, then 


afg) 


——~ = 0. 
ð( x1, X2) 


[Note: From (b), (c), and (d) we conclude that f and g are functionally 
dependent on a set A C R? if and only if (f, g)/0(x,, x,) =0 in A.] 


7.11. Consider the equation 


ðu ðu ðu 
Xi +x, +X; =n. 
Ox, OX, OX, 
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Let é =x,/%3, &) =X,/X3, €; = x3. Use this change of variables to show 
that the equation can be written as 


ðu 
— =u. 


; = 
0&3 


Deduce that u is of the form 
x, x 
u -x3r | = F =] : 
X3 X3 
7.12. Let u, and u, be defined as 
1/2 1/2 
u, =x (1x2) +x (1 =x), 
1/2 1/2 
uy (Lama) (aes) aes 
Show that uw, and u, are functionally dependent. 
7.13. Let f: R? > R? be defined as 


u=f(x), x = (x4, X3, X3)’, u= (uj, Uz, U3)’, 


where u; =xĵ, U, =x}, U3 =X}. 

(a) Show that the Jacobian matrix of f is not nonsingular in any subset 
D CR? that contains points on ay of the coordinate planes. 

(b) Show that f has a unique inverse everywhere in R° including any 
subset D of the type described in (a). 

[ Note: This exercise shows that the nonvanishing of the Jacobian 

determinant in Theorem 7.6.1 (inverse function theorem) is a sufficient 

condition for the existence of an inverse function, but is not necessary. ] 


7.14. Consider the equations 


81( X15 X2; Y1; Y2) =9, 
80( X1; X2; Y1 Y2) =O, 


where g, and g, are differentiable functions defined on a set DCR‘. 
Suppose that 0(g,, g,)/0(x,,x,) #0 in D. Show that 


OX, 2 8(81,82) / a(g 82) 


dy, (yi, X2) O(%1,%2)’ 


OX, _ 9(81582) I( 81,82) 


dy, A(X1,¥1) | A( X41, 2X2) 
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Let f(x), x2, x3) =0, g(x), x2, x3) = 0, where f and g are differentiable 
functions defined on a set D C R°. Suppose that 


olf, g) Š olf, g) 4 off, g) 


——_— A ———. : ———— #0 
O(X2,%3) O( x3, X4) ð( x1, X2) 
in D. Show that 
dx, dx, dx, 


olf, g) / al 22x3) ACS, 8)/2 (x3) O(F,8)/0( 4p x2) 


. Determine the stationary points of the following functions and check 


for local minima and maxima: 

(a) f=x? +x} +x, +x, + xx. 

(b) f=2ax? — xx, +x? +x; =x, +1, where a is a scalar. Can œ be 
chosen so that the stationary point is (i) a point of local minimum; 
(ii) a point of local maximum; (iii) a saddle point? 

(c) f=x} — 6x,x, + 3x5 — 24x; +4. 

(d) f=xf +x- 2x, -x,)’. 


. Consider the function 


E 1+p + Xi- Pi 
(1+p?+ ER p 


which is defined on the region 


C= {( Pis P2»: -->Pm)l0 <P <p; < 1,i=1,2,...,m}, 
where p is a known constant. Show that 


(a) df/dp;, for i=1,2,...,m, vanish at exactly one point in C. 

(b) The gradient vector Vf =(df/dp,, Of/Op>,..., Of/IPm)’ does not 
vanish anywhere on the boundary of C. 

(c) f attains its absolute maximum in the interior of C at the point 
(p/,p5,---> po), where 

1+p? 


oO 


Pi = 


; i=1,2,...,m. 
1+p 

[ Note: The function f was considered in an article by Thibaudeau and 
Styan (1985) concerning a measure of imbalance for experimental 
designs.] 


. Show that the function f= (x, —x?Xx,— 2x7) does not have a local 


maximum or minimum at the origin, although it has a local minimum 
for t = 0 along every straight line given by the equations x, = at, x, = bt, 
where a and b are constants. 
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7.19. 


7.20. 


7.21. 


7.22. 


7.23. 


7.24. 


7.25. 


7.26. 


Find the optimal values of the function f =x? + 12x;x, + 2x3 subject 
to 4x? +x} = 25. Determine the nature of the optima. 


Find the minimum distance from the origin to the curve of intersection 
of the surfaces, x,(x, +x,)= —2 and x,x,=1. 


Apply the method of Lagrange multipliers to show that 
1/3 
(x?xix3)  <3(x? Pe es) 
for all values of x4, x3, X3. 


[ Hint: Find the maximum value of f =xjx3xj subject to x? +x +x} = 
c’, where c is a constant.] 


Prove Theorem 7.9.3. 


Evaluate the following integrals: 
(a) f fp xox, dx, dx, where 


D = {as x2)lx; > 0, x3 > x7, x3 <2 Sh: 


(b) fol fo V 13 (x, + 3X) dx, ]dx,. 


Show that if f(x,,x,) is continuous, then 
2 4x, =x? 4 x 
f | Fann) d|d = f [Sete dee 
0 |x? 0 |°2-/4-x, 


Consider the integral 
1| f1—x? 
I= 1f( x1, x») dx, |dx,. 
pI 2) | 


(a) Write an equivalent expression for J by reversing the order of 
integration. 


(b) If g(x) = EZTS, x2) dey, find dg/dx,. 


Evaluate f [px,x, dx, dx), where D is a region enclosed by the four 
parabolas xê =x], x$ = 2x1, X? =x2, xf = 2x3. 
[ Hint: Use a proper change of variables.] 
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7.27. Evaluate ff p(x? +x3) dx, dx, dx,, where D is a sphere of radius 1 
centered at the origin. 
[ Hint: Make a change of variables using spherical polar coordinates of 
the form 


x, =r sin 6cos ¢, 
X,=rsin 6 sin ¢, 


X3 =r COS 0, 


0<r<1,0<0<7,0<¢<27)] 


7.28. Find the value of the integral 
= [% —, as 
o (1+ x?) 


[ Hint: Consider the integral h? dxa +x), where a > 0.] 


In Statistics 
7.29. Suppose that the random vector X = (X,, X,)’ has the density function 


xi +x, 0<x<1,0<x,<1, 
elsewhere. 


f(x x2) = 
(a) Are the random variables X, and X, independent? 
(b) Find the expected value of X,X,. 


7.30. Consider the density function f(x,,x,) of X = (X, X,)', where 


=x, <xX,<xX,,0<x,<1, 
elsewhere. 


f(x x2) = C 


Show that X, and X, are uncorrelated random variables [that is, 
E(X, X,) = E(X )E(X,)], but are not independent. 


7.31. The density function of X = (X, X,)' is given by 


1 
in e=, 0 <x], X3 <9, 


Fant) =f Te © 


0 elsewhere. 
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7.32. 


7.33. 


7.34. 


7.35. 


7.36. 


where a>0, B>0, and T(m) is the gamma function T(m)= 
Jo. e™*x"! dx, m > 0. Suppose that Y, and Y, are random variables 
defined as 


xX, 
Y, = ——, 
X, +X, 
Y, =X, +X. 


(a) Find the joint density function of Y, and Y,. 
(b) Find the marginal densities of Y, and Y,. 
(c) Are Y, and Y, independent? 


Suppose that X = (X,, X,)’ has the density function 


10x47; 0<x<x,,0<x,<1, 
0 elsewhere. 


f(x x2) = | 
Find the density function of W = X, X}. 


Find the density function of W=(X;+X73)'” given that X = (X,, X,)’ 
has the density function 


$i) = Ax 1X ett x, >0, x, >0, 
0 elsewhere. 


Let X,, X,,...,X,, be independent random variables that have the 
exponential density f(x) =e“, x >0. Let Y\,Y,,...,¥, be n random 
variables defined as 


Y,=X, +X, + +X,. 


Find the density of Y=(Y,,Y,,...,Y,)’, and then deduce the marginal 
density of Y,. 


Prove formula (7.84). 


Let X, and X, be independent random variables such that W, = 
(6/o/)X, and W,=(8/o7)X, have the chi-squared distribution with 
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six and eight degrees of freedom, respectively, where of and of are 
unknown parameters. Let 0= 70) + 503. An unbiased estimator of 6 
is given by ô = 4X, + įX,, since X, and X, are unbiased estimators of 
of and ož, respectively. 

Using Satterthwaite’s approximation (see Satterthwaite, 1946), it can 
be shown that 16/6 is approximately distributed as a chi-squared 
variate with ņn degrees of freedom, where 7 is given by 


which can be written as 


8(9 + 7A)? 


7 108 + 4922’ 


where A= of/of. It follows that the probability 


nô nô 


p=P <0< 


XD ,m Xoo, 

where Xen denotes the upper 100a% point of the chi-squared distri- 
bution with ņ degrees of freedom, is approximately equal to 0.95. 
Compute the exact value of p using double integration, given that 
A = 2. Compare the result with the 0.95 value. 

[ Notes: (1) The density function of a chi-squared random variable with 
n degrees of freedom is given in Example 6.9.6. (2) In general, 7 is 
unknown. It can be estimated by 7 which results from replacing A with 
A=X,/X, in the formula for n. (3) The estimator @ is used in the 
Behrens-—Fisher test statistic for comparing the means of two popula- 
tions with unknown variances øf and ož, which are assumed to be 
unequal. If Y, and Y, are the means of two independent samples of 
sizes n}; =7 and n,=9, respectively, randomly chosen from these 
populations, then 0 is the variance of Y, — Y>. In this case, X, and X, 
represent the corresponding sample variances. The Behrens—Fisher 
t-statistic is then given by 


Y, J Y, 
vô ` 
If the two population means are equal, ¢ has approximately the t-distri- 


bution with ņ degrees of freedom. For more details about the 
Behrens—Fisher test, see for example, Brownlee (1965, Section 9.8).] 


t= 
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7.37. 


7.38. 


7.39. 


Suppose that a parabola of the form p= By + Bix + Bx? is fitted to a 
set of paired data, (x1, Y1), (X2, Y2), ---, (Xn Yn). Obtain estimates of Bo, 
Bı, and B, by minimizing X? ily; — (By + Bix; + Bax? )F with respect 
to Bo, Bi, and PB. 

[ Note: The estimates obtained in this manner are the least-squares 
estimates of By, B,, and B,.] 


Suppose that we have k disjoint events A,,A,,...,A, such that the 
probability of A; is p; (i=1,2,...,k) and Lf, p;=1. Furthermore, 
suppose that among n independent trials there are X,, X,..., Xy 
outcomes associated with A,, A,,..., Ap, respectively. The joint proba- 
bility that X, =x,, X, =Xx,..., Xp =x, is given by the likelihood func- 
tion 

n! 


L(x,p) = ry PIP? Bits 
ke 


X,!x,!- 


where x,=0,1,2,...,n for i=1,2,...,k such that X%_;x;=n,x= 
(xis X25.: XK)’ P = (Pi; Po»--- py)’. This defines a joint distribution 
for X,, X,,..., X, known as the multinomial distribution. 

Find the maximum likelihood estimates of p4, p,,..., Pp by maximiz- 
ing L(x, p) subject to DX, p, =1. 
[ Hint: Maximize the natural logarithm of př!pž?+ př% subject to 
Li p=1] 


Let (y) be a positive, even, and continuous function on (— %, %) such 


that (y) is strictly decreasing on (0, œ), and {*.,@(y) dy = 1. Consider 
the following bivariate density function: 


1+x/d(y), —o(y) <x <0, 
f(%y)=\1-x/o(y), O<x< d(y), 


0 otherwise. 


(a) Show that f(x, y) is continuous for —% <x, y < ©, 
(b) Let F(x, y) be the corresponding cumulative distribution function, 


F(x,y) = f f f(s,0) dsde. 
Show that if 0 < Ax < (0), then 
-1 S 
F(Ax,0) — F(0,0) > [1 4” ("1 - —_] dsat 
(Ax,0) -F00 > ff S -gml 


> 34x '(Ax), 


where ~! is the inverse function of #(y) for 0 <y < œ. 
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(c) Use part (b) to show that 


i F(Ax,0) — F(0,0) 
lim = 


Ax>0* Ax 


Hence, dF(x, y)/ðx does not exist at (0, 0). 
(d) Deduce from part (c) that the equality 


d°F (x,y) 

f(x,y) = ETI 
does not hold in this example. 

[Note: This example was given by Wen (2001) to demonstrate that 

continuity of f(x, y) is not sufficient for the existence of óF/ðx, and 

hence for the validity of the equality in part (d).] 


CHAPTER 8 


Optimization in Statistics 


Optimization is an essential feature in many problems in statistics. This is 
apparent in almost all fields of statistics. Here are few examples, some of 
which will be discussed in more detail in this chapter. 


1. In the theory of estimation, an estimator of an unknown parameter is 
sought that satisfies a certain optimality criterion such as minimum 
variance, maximum likelihood, or minimum average risk (as in the case 
of a Bayes estimator). Some of these criteria were already discussed in 
Section 7.11. For example, in regression analysis, estimates of the 
parameters of a fitted model are obtained by minimizing a certain 
expression that measures the closeness of the fit of the model. One 
common example of such an expression is the sum of the squared 
residuals (these are deviations of the predicted response values, as 
specified by the model, from the corresponding observed response 
values). This particular expression is used in the method of ordinary 
least squares. A more general class of parameter estimators is the class 
of M-estimators. See Huber (1973,1981). The name “M-estimator” 
comes from “generalized maximum likelihood.” They are based on the 
idea of replacing the squared residuals by another symmetric function 
of the residuals that has a unique minimum at zero. For example, 
minimizing the sum of the absolute values of the residuals produces the 
so-called least absolute values (LAV) estimators. 

2. Estimates of the variance components associated with random or mixed 
models are obtained by using several methods. In some of these 
methods, the estimates are given as solutions to certain optimization 
problems as in maximum likelihood (ML) estimation and minimum 
norm quadratic unbiased estimation (MINQUE). In the former method, 
the likelihood function is maximized under the assumption of normally 
distributed data [see Hartley and Rao (1967)]. A completely different 
approach is used in the latter method, which was proposed by Rao 
(1970, 1971). This method does not require the normality assumption. 
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For a review of methods of estimating variance components, see Khuri 
and Sahai (1985). 


In statistical inference, tests are constructed so that they are optimal in 
a certain sense. For example, in the Neyman-Pearson lemma (see, for 
example, Roussas, 1973, Chapter 13), a test is obtained by minimizing 
the probability of Type II error while holding the probability of Type I 
error at a certain level. 


In the field of response surface methodology, design settings are chosen 
to minimize the prediction variance inside a region of interest, or to 
minimize the bias that occurs from fitting the “wrong” model. Other 
optimality criteria can also be considered. For example, under the 
D-optimality criterion, the determinant of the variance—covariance ma- 
trix of the least-squares estimator of the vector of unknown parameters 
(of a fitted model) is minimized with respect to the design settings. 


. Another objective of response surface methodology is the determina- 


tion of optimum operating conditions on the input variables that 
produce maximum, or minimum, response values inside a region of 
interest. For example, in a particular chemical reaction setting, it may 
be of interest to determine the reaction temperature and the reaction 
time that maximize the percentage yield of a product. Optimum seeking 
methods in response surface methodology will be discussed in detail in 
Section 8.3. 


. Several response variables may be observed in an experiment for each 


setting of a group of input variables. Such an experiment is called a 
multiresponse experiment. In this case, optimization involves a number 
of response functions and is therefore referred to as simultaneous (or 
multiresponse) optimization. For example, it may be of interest to 
maximize the yield of a certain chemical compound while reducing the 
production cost. Multiresponse optimization will be discussed in Sec- 
tion 8.7. 


. In multivariate analysis, a large number of measurements may be 


available as a result of some experiment. For convenience in the 
analysis and interpretation of such data, it would be desirable to work 
with fewer of the measurements, without loss of much information. 
This problem of data reduction is dealt with by choosing certain linear 
functions of the measurements in an optimal manner. Such linear 
functions are called principal components. 

Optimization of a multivariable function was discussed in Chapter 7. 
However, there are situations in which the optimum cannot be obtained 
explicitly by simply following the methods described in Chapter 7. 
Instead, iterative procedures may be needed. In this chapter, we shall 
first discuss some commonly used iterative optimization methods. A 
number of these methods require the explicit evaluation of the partial 
derivatives of the function to be optimized (objective function). These 
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are referred to as the gradient methods. Three other optimization 
techniques that rely solely on the values of the objective function will 
also be discussed. They are called direct search methods. 


8.1. THE GRADIENT METHODS 


Let f(x) be a real-valued function of k variables x,,x,,...,x,, where 
X= (X,,%5,...,X;,)’. The gradient methods are based on approximating f(x) 
with a low-degree polynomial, usually of degree one or two, using Taylor’s 
expansion. The first- and second-order partial derivatives of f(x) are there- 
fore assumed to exist at every point x in the domain of f. Without loss of 
generality, we shall consider that f is to be minimized. 


8.1.1. The Method of Steepest Descent 


This method is based on a first-order approximation of f(x) with a polyno- 
mial of degree one using Taylor’s theorem (see Section 7.5). Let x, be an 
initial point in the domain of f(x). Let xj+th, be a neighboring point, 
where th, represents a small change in the direction of a unit vector họ (that 
is, t>0). The corresponding change in f(x) is f(xy + thy) —f(xo). A first- 
order approximation of this change is given by 


f (Xo + tho) —f(Xo) = thoVf (xo); (8.1) 


as can be seen from applying formula (7.27). If the objective is to minimize 
f(x), then h, must be chosen so as to obtain the largest value for —th)Vf(x,). 
This is a constrained maximization problem, since họ has unit length. For this 
purpose we use the method of Lagrange multipliers. Let F be the function 


F= —thVf(xo) + A(hiyhy — 1). 


By differentiating F with respect to the elements of hy and equating the 
derivatives to zero we obtain 


t 
ho= 2A F(X). (8.2) 
Using the constraint hho = 1, we find that A must satisfy the equation 
2 2 2 
à= gloz, (8.3) 


where ||Vf(x,)ll2 is the Euclidean norm of Vf(x,). In order for —rh,Vf(x,) 
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to have a maximum, A must be negative. From formula (8.3) we then have 


t 
A=- zIlYf&o)ll2. 
By substituting this expression in formula (8.2) we get 


Vf(xo) 


w N 


(8.4) 


Thus for a given t > 0, we can achieve a maximum reduction in f(xy) by 
moving from x, in the direction specified by hy in formula (8.4). The value of 
t is now determined by performing a linear search in the direction of hy. This 
is accomplished by increasing the value of ¢ (starting from zero) until no 
further reduction in the values of f is obtained. Let such a value of t be 
denoted by tọ. The corresponding value of x is given by 


sas Vf (Xo) 
9 CYF) | 


x, = 


Since the direction of hy is in general not toward the location x* of the 
true minimum of f, the above process must be performed iteratively. Thus if 
at stage i we have an approximation x; for x*, then at stage i + 1 we have the 
approximation 


Xi =x, +t;h;, i=0,1,2,..., 
where 
Vf(x;) 
c= 2. GSO Ox: 
IIVE(x, Ile 


and t; is determined by a linear search in the direction of h,, that is, t; is the 
value of ¢ that minimizes f(x; + th,). Note that if it is desired to maximize f, 
then for each i (> 0) we need to move in the direction of —h,. In this case, 
the method is called the method of steepest ascent. 

Convergence of the method of steepest descent can be very slow, since 
frequent changes of direction may be necessary. Another reason for slow 
convergence is that the direction of h, at the ith iteration may be nearly 
perpendicular to the direction toward the minimum. Furthermore, the method 
becomes inefficient when the first-order approximation of f is no longer 
adequate. In this case, a second-order approximation should be attempted. 
This will be described in the next section. 
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8.1.2. The Newton—Raphson Method 


Let x, be an initial point in the domain of f(x). By a Taylor’s expansion of f 
in a neighborhood of xg (see Theorem 7.5.1), it is possible to approximate 
f(x) with the quadratic function (x) given by 


1 
(x) =F(Xo) + (x = Xo)’ VF(Xo) + zr (K— Xo)'Hy(Ko)(K— Xo), (8.5) 


where H, (xy) is the Hessian matrix of f evaluated at x). 

On the basis of formula (8.5) we can obtain a reasonable approximation to 
the minimum of f(x) by using the minimum of (x). If (x) attains a local 
minimum at x,, then we must necessarily have V(x,) = 0 (see Section 7.7), 
that is, 


Vf(Xo) + Hy (Xo) (X1 — Xo) =0. (8.6) 
If H, (xy) is nonsingular, then from equation (8.6) we obtain 
xy Hy! (Xo) Vf(Xo)- 


If we now approximate f(x) with another quadratic function, by again 
applying Taylor’s expansion in a neighborhood of x,, and then repeat the 
same process as before with x, used instead of xy, we obtain the point 


X= Xy >- H; '(x,)Vf(x,)- 


Further repetitions of this process lead to a sequence of points, 
Xo, X1,X9,---,X;,---, such that 


X;4, =x, H; (x;)Vf(x;), i=0,1,2,.... (8.7) 


The Newton—Raphson method requires finding the inverse of the Hessian 
matrix H, at each iteration. This can be computationally involved, especially 
if the number of variables, k, is large. Furthermore, the method may fail to 
converge if H,(x;) is not positive definite. This can occur, for example, when 
x, is far from the location x* of the true minimum. If, however, the initial 
point x, is close to x*, then convergence occurs at a rapid rate. 


8.1.3. The Davidon—Fletcher—Powell Method 


This method is basically similar to the one in Section 8.1.1 except that at the 
ith iteration we have 


Xi41 =X; — 6,G,Vf(x;), i=0,1,2,..., 
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where G; is a positive definite matrix that serves as the ith approximation to 
the inverse of the Hessian matrix H;(x;), and 6; is a scalar determined by a 
linear search from x; in the direction of —G,Vf(x;), similar to the one for the 
steepest descent method. The initial choice G, of the matrix G can be any 
positive definite matrix, but is usually taken to be the identity matrix. At the 
(i + 1)st iteration, G, is updated by using the formula 


G1: =G,+L;+M,, i=0,1,2,..., 
where 
as G [VF — VEV En) — VAG, 
' [Vf(xi41) = VADI [V E) E Vf(x;)| 


M.= — A[GVS(x:) [CVE] 
[GYED IVf) - Vf(x;)] i 


The justification for this method is given in Fletcher and Powell (1963). 
See also Bunday (1984, Section 4.3). Note that if G, is initially chosen as the 
identity, then the first increment is in the steepest descent direction — Vf(x,). 

This is a powerful optimization method and is considered to be very 
efficient for most functions. 


8.2. THE DIRECT SEARCH METHODS 


The direct search methods do not require the evaluation of any partial 
derivatives of the objective function. For this reason they are suited for 
situations in which it is analytically difficult to provide expressions for the 
partial derivatives, such as the minimization of the maximum absolute devia- 
tion. Three such methods will be discussed here, namely, the Nelder-Mead 
simplex method, Price’s controlled random search procedure, and general- 
ized simulated annealing. 


8.2.1. The Nelder-Mead Simplex Method 


Let f(x), where x = (x4, X),...,,)', be the function to be minimized. The 
simplex method is based on a comparison of the values of f at the k+1 
vertices of a general simplex followed by a move away from the vertex with 
the highest function value. By definition, a general simplex is a geometric 
figure formed by a set of k+1 points called vertices in a k-dimensional 
space. Originally, the simplex method was proposed by Spendley, Hext, and 
Himsworth (1962), who considered a regular simplex, that is, a simplex with 
mutually equidistant points such as an equilateral triangle in a two-dimen- 
sional space (k= 2). Nelder and Mead (1965) modified this method by 
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allowing the simplex to be nonregular. This modified version of the simplex 
method will be described here. 

The simplex method follows a sequential search procedure. As was men- 
tioned earlier, it begins by evaluating f at the k+1 points that form a 
general simplex. Let these points be denoted by x,,x,,...,x,,,. Let f, and 
fı denote, respectively, the largest and the smallest of the values 
f(x,), f(x,),..., f(x;,4,). Let us also denote the points where f, and f, are 
attained by x, and x,, respectively. 

Obviously, if we are interested in minimizing f, then a move away from x, 
will be in order. Let us therefore define x, as the centroid of all the points 
with the exclusion of x,. Thus 


1 
Kem T xp 


ith 


In order to move away from x,, we reflect x, with respect to x, to obtain the 
point x7. More specifically, the latter point is defined by the relation 


Xp Xo 5 r(x. Xn), 
or equivalently, 
x% =(1+1r)x, —rx,, 
where r is a positive constant called the reflection coefficient and is given by 


Ix% — x.ll2 
r= ———_., 
IIx. — x,ll2 


The points x,, X., and x; are depicted in Figure 8.1. Let us consider the 


Cc? 


x 
Xhe 


Xa 


Figure 8.1. A two-dimensional simplex with the reflection (x7), expansion (x%,), and contraction 
(x%.) points. 
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following cases: 


a. 


= 


If f,<f(x;) <f,,, replace x, by xj; and start the process again with the 
new simplex (that is, evaluate f at the vertices of the simplex which has 
the same points as the original simplex, but with x¥ substituted for x,,). 
If f(x}) <f, then the move from x, to x7 is in the right direction and 
should therefore be expanded. In this case, xj is expanded to xj, 
defined by the relation 


Xhe — Xe = Y(X} = Xe), 
that is, 

Xhe = YX, + (1 — y)Xe, 
where y (> 1) is an expansion coefficient given by 


o [xke = Xell2 


y= 
Ix — x.ll2 


(see Figure 8.1). This operation is called expansion. If f(x7.) <f,, 
replace x, by x7;, and restart the process. However, if f(x;,) >f, then 
expansion is counterproductive. In this case, x7, is dropped, x, is 
replaced by x7, and the process is restarted. 


. If upon reflecting x, to xj we discover that f(x) >f(x;) for all i +h, 


then replacing x, by xj would leave f(xž) as the maximum in the new 
simplex. In this case, a new x, is defined to be either the old x, or xj, 
whichever has the lower value. A point x7, is then found such that 


Xhe TX. = B(x) = Xe), 
that is, 
Xie = BX, + (1 F B)X., 
where B (0 < 8< 1) is a contraction coefficient given by 


IIx. — Xell2 


Ix, —X¢llo ` 


Next, xj, is substituted for x, and the process is restarted unless 
faš.) > minl f,, fŒ% )], that is, the contracted point is worse than the 
better of f, and f(x*). When such a contraction fails, the size of the 
simplex is reduced by halving the distance of each point of the simplex 
from x,, where, if we recall, x, is the point generating the lowest 
function value. Thus x; is replaced by x, + 5(x; — x,), that is, by 5(x; + 
x,). The process is then restarted with the new reduced simplex. 
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pa Step 1 


No — Step 2b 


Is f(x%,) <f? No ——> Step 2b 


Yes — Exit 


Restart No 


Step 1. Select initial points, x,,x2,...,%,+1, and calculate f(x;), i = 1,2,...,k + 1. 
Determine x,, x, and calculate x, = L; p pX;/k. Select r > 0, say r= 4, 3, or 1, find 
x} = (1 +r)x, — rX, and calculate f(x% ). 

Step 2. (a) Calcutate xf, = yx% + (1 — y)x, by choosing y> 1, say y= 1.5, then calculate 

FOF). 

(b) Replace x, with xž. 

Step 3. Replace x, with x*,. 

Step 4. Calculate x}, = Bx, + (1 — B)x, by choosing 0 < B < 1, say B = 0.5, then calculate 
IO). 

Step 5. Replace all the x;s with (x; +x,)/2. 

Step 6. Replace x, with x}... 


Figure 8.2. Flow diagram for the Nelder-Mead simplex method. Source: Nelder and Mead 
(1965). Reproduced with permission of Oxford University Press. 


Thus at each stage in the minimization process, x,, the point at 
which f has the highest value, is replaced by a new point according to 
one of three operations, namely, reflection, contraction, and expansion. 
As an aid to illustrating this step-by-step procedure, a flow diagram is 
shown in Figure 8.2. This flow diagram is similar to one given by Nelder 
and Mead (1965, page 309). Figure 8.2 lists the explanations of steps 1 
through 6. 

The criterion used to stop the search procedure is based on the 
variation in the function values over the simplex. At each step, the 
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standard error of these values in the form 


se- |” 


= k 


is calculated and compared with some preselected value d, where 
fio f2»---» fk}1 denote the function values at the vertices of the simplex 
at hand and f= Eek +1). The search is halted when s <d. The 
reasoning behind this criterion is that when s <d, all function values 
are very close together. This hopefully indicates that the points of the 
simplex are near the minimum. 


Bunday (1984) provided the listing of a computer program which can be 
used to implement the steps described in the flow diagram. 

Olsson and Nelson (1975) demonstrated the usefulness of this method by 
using it to solve six minimization problems in statistics. The robustness of the 
method itself and its advantages relative to other minimization techniques 
were reported in Nelson (1973). 


8.2.2. Price’s Controlled Random Search Procedure 


The controlled random search procedure was introduced by Price (1977). It is 
capable of finding the absolute (or global) minimum of a function within a 
constrained region R. It is therefore well suited for a multimodal function, 
that is, a function that has several local minima within the region R. 

The essential features of Price’s algorithm are outlined in the flow diagram 
of Figure 8.3. A predetermined number, N, of trial points are randomly 
chosen inside the region R. The value of N must be greater than k, the 
number of variables. The corresponding function values are obtained and 
stored in an array A along with the coordinates of the N chosen points. At 
each iteration, k +1 distinct points, x,,x,...,X,,,, are chosen at random 
from the N points in storage. These k+1 points form a simplex in a 
k-dimensional space. The point x,,, is arbitrarily taken as the pole (desig- 
nated vertex) of the simplex, and the next trial point x, is obtained as the 
image (reflection) point of the pole with respect to the centroid x, of the 
remaining k points. Thus 


K,= 2K, — Xk41: 


The point x, must satisfy the constraints of the region R. The value of the 
function f at x, is then compared with f,,,,, the largest function value in 
storage. Let Xmax denote the point at which f,,,, is achieved. If f(x,) <fiays 
then Xmax 1s replaced in the array A by x,. If x, fails to satisfy the constraints 
of the region R, or if f(x,) > fna’ then x, is discarded and a new point is 
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Input: The number of variables, k; the number of points, N, to be stored; the constraints of 
the region R. 


Choose N points at random from R, and store the coordinates and corresponding function 
values inan N x (k + 1) matrix A. 


Determine the point, X maw that has the largest function value, fmax- 


Randomly choose k + 1 distinct points x,,x,...,%,4, from the set of N points in storage. 
Determine the centroid x, of x;,x2,...,X,; then determine the next trial point x, = 2x, — 


Xk+ r 


Does x, fall in the region R? 


Yes 


Evaluate f(x,) 


Is fap) < Fmax? 


Yes 


Replace in the matrix A the coordinates and function value at X max by those of x,. 


Is the stopping criterion 
satisfied? 


No 


Yes 


Print A and stop 


Figure 8.3. A flow diagram for Price’s procedure. Source: Price (1977). Reproduced with 
permission of Oxford University Press. 
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chosen by following the same procedure as the one used to obtain x,. As the 
algorithm proceeds, the N points in storage tend to cluster around points at 
which the function values are lower than the current value of f,,,,. Price did 
not specify a particular stopping rule. He left it to the user to do so. A 
possible stopping criterion is to terminate the search when the N points in 
storage cluster in a small region of the k-dimensional space, that is, when 
fmax and fmin are close together, where fain is the smallest function value in 
storage. Another possibility is to stop after a specified number of function 
evaluations have been made. In any case, the rate of convergence of the 
procedure depends on the value of N, the complexity of the function f, the 
nature of the constraints, and the way in which the set of trial points is 
chosen. 

Price’s procedure is simple and does not necessarily require a large value 
of N. It is sufficient that N should increase linearly with k. Price chose, for 
example, the value N = 50 for k = 2. The value N= 10k has proved useful 
for many functions. Furthermore, the region constraints can be quite com- 
plex. A FORTRAN program for the implementation of Price’s algorithm was 
written by Conlon (1991). 


8.2.3. The Generalized Simulated Annealing Method 


This method derives its name from the annealing of metals, in which many 
final crystalline configurations (corresponding to different energy states) are 
possible, depending on the rate of cooling (see Kirkpartrick, Gelatt, and 
Vechhi, 1983). The method can be applied to find the absolute (or global) 
optimum of a multimodal function f within a constrained region R in a 
k-dimensional space. 

Bohachevsky, Johnson, and Stein (1986) presented a generalization of the 
method of simulated annealing for function optimization. The following is a 
description of their algorithm for function minimization (a similar one can be 
used for function maximization): Let f,, be some tentative estimate of the 
minimum of f over the region R. The method proceeds according to the 
following steps (reproduced with permission of the American Statistical 
Association): 


1. Select an initial point xj in R. This point can be chosen at random or 
specified depending on available information. 

2. Calculate fy =f(xo). If Ifo —finl <€, where e is a specified small 
constant, then stop. 

3. Choose a random direction of search by generating k independent 


standard normal variates z,,Z,...,2Z,; then compute the elements of 
the random vector u = (u4, U3, ..., Ug)’, where 
Zi ; 
u, = i=1,2,...,k, 


O (zptag + z) 


and k is the number of variables in the function. 
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4. Set x; =x,+ Ar u, where Ar is the size of a step to be taken in the 
direction of u. The magnitude of Ar depends on the properties of the 
objective function and on the desired accuracy. 


5. If x, does not belong to R, return to step 3. Otherwise, compute 
fi =f(x)) and Af=f, — fo. 

6. if fi < fo, set x) =x, and fo = fi- If |fy —f,,,| < €, stop. Otherwise, go to 
step 3. 

7. If fi > fo, set a probability value given by p = exp(— BfZ Af), where B 
is a positive number such that 0.50 < exp(— BAf) < 0.90, and g is an 
arbitrary negative number. Then, generate a random number v from 
the uniform distribution U(0,1). If v >p, go to step 3. Otherwise, if 
v <p, set Xo =X,, fo = fi and go to step 3. 


From steps 6 and 7 we note that beneficial steps (that is, fı <f,) are 
accepted unconditionally, but detrimental steps (f; > fọ) are accepted accord- 
ing to a probability value p described in step 7. If v <p, then the step leading 
to x, is accepted; otherwise, it is rejected and a step in a new random 
direction is attempted. Thus the probability of accepting an increment of f 
depends on the size of the increment: the larger the increment, the smaller 
the probability of its acceptance. 

Several possible values of the tentative estimate f,,, can be attempted. For 
a given f„ we proceed with the search until f—/f,, becomes negative. Then, 
we decrease f„, continue the search, and repeat the process when necessary. 
Bohachevsky, Johnson, and Stein gave an example in optimal design theory 
to illustrate the application of their algorithm. 

Price’s (1977) controlled random search algorithm produces results compa- 
rable to those of simulated annealing, but with fewer tuning parameters. It is 
also better suited for problems with constrained regions. 


8.3. OPTIMIZATION TECHNIQUES IN RESPONSE 
SURFACE METHODOLOGY 


Response surface methodology (RSM) is an area in the design and analysis of 
experiments. It consists of a collection of techniques that encompasses: 


1. Conducting a series of experiments based on properly chosen settings 
of a set of input variables, denoted by x,, x ,...,x,, that influence a 
response of interest y. The choice of these settings is governed by 
certain criteria whose purpose is to produce adequate and reliable 
information about the response. The collection of all such settings 
constitutes a matrix D of order n Xk, where n is the number of 
experimental runs. The matrix D is referred to as a response surface 
design. 
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2. Determining a mathematical model that best fits the data collected 
under the design chosen in (1). Regression techniques can be used to 
evaluate the adequacy of fit of the model and to conduct appropriate 
tests concerning the model’s parameters. 

3. Determining optimal operating conditions on the input variables that 
produce maximum (or minimum) response value within a region of 
interest R. 


This last aspect of RSM can help the experimenter in determining the best 
combinations of the input variables that lead to desirable response values. 
For example, in drug manufacturing, two drugs are tested with regard to 
reducing blood pressure in humans. A series of clinical trials involving a 
certain number of high blood pressure patients is set up, and each patient 
is given some predetermined combination of the two drugs. After a period of 
time the patient’s blood pressure is checked. This information can be used to 
find the specific combination of the drugs that results in the greatest 
reduction in the patient’s blood pressure within some specified time interval. 

In this section we shall describe two well-known optimum-seeking proce- 
dures in RSM. These include the method of steepest ascent (or descent) and 
ridge analysis. 


8.3.1. The Method of Steepest Ascent 


This is an adaptation of the method described in Section 8.1.1 to a response 
surface environment; here the objective is to increase the value of a certain 
response function. 

The method of steepest ascent requires performing a sequence of sets of 
trials. Each set is obtained as a result of proceeding sequentially along a path 
of maximum increase in the values of a given response y, which can be 
observed in an experiment. This method was first introduced by Box and 
Wilson (1951) for the general area of RSM. 

The procedure of steepest ascent depends on approximating a response 
surface with a hyperplane in some restricted region. The hyperplane is 
represented by a first-order model which can be fitted to a data set obtained 
as a result of running experimental trials using a first-order design such as a 
complete 2* factorial design, where k is the number of input variables in the 
model. A fraction of this design can also be used if k is large [see, for 
example, Section 3.3.2 in Khuri and Cornell (1996)]. The fitted first-order 
model is then used to determine a path along which one may initially observe 
increasing response values. However, due to curvature in the response 
surface, the initial increase in the response will likely be followed by a 
leveling off, and then a decrease. At this stage, a new series of experiments is 
performed (using again a first-order design) and the resulting data are used 
to fit another first-order model. A new path is determined along which 
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increasing response values may be observed. This process continues until it 
becomes evident that little or no additional increase in the response can be 
gained. 

Let us now consider more specific details of this sequential procedure. Let 
y(x) be a response function that depends on k input variables, x,,x,,...,X;; 
which form the elements of a vector x. Suppose that in some restricted region 
y(x) is adequately represented by a first-order model of the form 


k 
y(x) = By + È Bixi+te, (8.8) 

i=1 
where Bo, B;,..., Bp are unknown parameters and e is a random error. This 
model is fitted using data collected under a first-order design (for example, a 
2* factorial design or a fraction thereof). The data are utilized to calculate 
the least-squares estimates Êo» Ê... ; By of the model’s parameters. These 
are elements of B = (X’X)~!X’y, where X = [1,: D] with 1,, being a vector of 
ones of order n X1, D is the design matrix of order n "Xk, and y is the 
corresponding vector of response values. It is assumed that the random 
errors associated with the n response values are independently distributed 
with means equal to zero and a common variance o°. The predicted 

response f(x) is then given by 


k 
I(x) = Bot È Bx. (8.9) 
i=1 
The input variables are coded so that the design center coincides with the 
origin of the coordinates system. 

The next step is to move a distance of r units away from the design center 
(or the origin) such that a maximum increase in ŷ can be obtained. To 
determine the direction to be followed to achieve such an increase, we need 
to maximize (x) subject to the constraint Li. ,x? =r" using the method of 
Lagrange multipliers. Consider therefore the function 


k k 
Q(x) = ĝo + » âx- È=], (8.10) 
i=1 i=1 


where A is a Lagrange multiplier. Setting the partial derivatives of Q equal to 
zero produces the equations 


j=1,2,...,k. 
x; = 3) Êr L Jey $ 


For a maximum, A must be positive. Using the equality constraint, we 
conclude that 
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A local maximum is then achieved at the point whose coordinates are given 
by 


rpi 
pem i=1,2,...,k, 
(Ziz br) 
which can be written as 
xX; =Tre;, i=1,2,...,k, (8.11) 


where e; = Œ% 67)? i= 1,2,...,k. Thus e= (ej, 9,...,€,)' is a unit 
vector in the direction of ( 64, B5,..., B,)’. Equations (8.11) indicate that at a 
distance of r units away from the origin, a maximum increase in ŷ occurs 
along a path in the direction of e. Since this is the only local maximum on the 
hypersphere of radius r, it must be the absolute maximum. 

If the actual response value (that is, the value of y) at the point x =re 
exceeds its value at the origin, then a move along the path determined by e is 
in order. A series of experiments is then conducted to obtain response values 
at several points along the path until no additional increase in the response is 
evident. At this stage, a new first-order model is fitted using data collected 
under a first-order design centered at a point in the vicinity of the point at 
which that first drop in the response was observed along the path. This model 
leads to a new direction similar to the one given by formula (8.11). As before, 
a series of experiments are conducted along the new path until no further 
increase in the value of y can be observed. The process of moving along 
different paths continues until it becomes evident that little or no additional 
increase in y can be gained. This usually occurs when the first-order model 
becomes inadequate as the method progresses, due to curvature in the 
response surface. It is therefore necessary to test each fitted first-order 
model for lack of fit at every stage of the process. This can be accomplished 
by taking repeated observations at the center of each first-order design and 
at possibly some other design points in order to obtain an independent 
estimate of the error variance that is needed for the lack of fit test [see, for 
example, Sections 2.6 and 3.4 in Khuri and Cornell (1996)]. If the lack of fit 
test is significant, indicating an inadequate model, then the process is 
stopped and a more elaborate experiment must be conducted to fit a 
higher-order model, as will be seen in the next section. 

Examples that illustrate the application of the method of steepest ascent 
can be found in Box and Wilson (1951), Bayne and Rubin (1986, Section 5.2), 
Khuri and Cornell (1996, Chapter 5), and Myers and Khuri (1979). In the last 
reference, the authors present a stopping rule along a path that takes into 
account random error variation in the observed response. We recall that a 
search along a path is discontinued as soon as a drop in the response is first 
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observed. Since response values are subject to random error, the decision to 
stop can be premature due to a false drop in the observed response. The 
stopping rule by Myers and Khuri (1979) protects against taking too many 
observations along a path when in fact the true mean response (that is, the 
mean of y) is decreasing. It also protects against stopping prematurely when 
the true mean response is increasing. 

It should be noted that the procedure of steepest ascent is not invariant 
with respect to the scales of the input variables x4, x,,..., Xp. This is evident 
from the fact that a path taken by the procedure is determined by the 
least-squares estimates B,, B5,..., 6, [see equations (8.11)], which depend on 
the scales of the x,’s. 

There are situations in which it is of interest to determine conditions that 
lead to a decrease in the response, instead of an increase. For example, in a 
chemical investigation it may be desired to decrease the level of impurity or 
the unit cost. In this case, a path of steepest descent will be needed. This can 
be accomplished by changing the sign of the response y, followed by an 
application of the method of steepest ascent. Thus any steepest descent 
problem can be handled by the method of steepest ascent. 


8.3.2. The Method of Ridge Analysis 


The method of steepest ascent is most often used as a maximum-region-seek- 
ing procedure. By this we mean that it is used as a preliminary tool to get 
quickly to the region where the maximum of the mean response is located. 
Since the first-order approximation of the mean response will eventually 
break down, a better estimate of the maximum can be obtained by fitting a 
second-order model in the region of the maximum. The method of ridge 
analysis, which was introduced by Hoerl (1959) and formalized by Draper 
(1963), is used for this purpose. 

Let us suppose that inside a region of interest R, the true mean response 
is adequately represented by the second-order model 


-1 k 
y(x) = ~~ Bo + > Bix; F F £ BijXiXj xj + » Bix? +e, (8.12) 
i=1 j=2 i=1 
i<j 


where the B’s are unknown parameters and e is a random error with mean 
zero and variance a7. Model (8.12) can be written as 


y(x) = By + x'B + x'Bx +e, (8.13) 
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where B =(B,, B2,- .--, B,)’ and B is a symmetric k X k matrix of the form 


Bu 2By Bis ~ 2Bix 
Bap 2bz ke 7 Box 
B= . a : 
Bk -1, 
symmetric Bex 


Least-squares estimates of the parameters in model (8.13) can be obtained by 
using data collected according to a second-order design. A description of 
potential second-order designs can be found in Khuri and Cornell (1996, 
Chapter 4). . . 

Let By, B, and B denote the least-squares estimates of By, B, and B, 
respectively. The predicted response ¥(x) inside the region R is then given by 


f(x) = Ê + x'Ê + x'Êx. (8.14) 


The input variables are coded so that the design center coincides with the 
origin of the coordinates system. 

The method of ridge analysis is used to find the optimum (maximum or 
minimum) of (x) on concentric hyperspheres of varying radii inside the 
region R. It is particularly useful in situations in which the unconstrained 
optimum of f(x) falls outside the region R, or if a saddle point occurs 
inside R. 

Let us now proceed to optimize ĵ(x) subject to the constraint 


k 
Err, (8.15) 


where r is the radius of a hypersphere centered at the origin and is contained 
inside the region R. Using the method of Lagrange multipliers, let us 
consider the function 


k 
F=ĵ(x)-— | X x? -”], (8.16) 


where A is a Lagrange multiplier. Differentiating F with respect to x; 
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(i =1,2,...,k) and equating the partial derivatives to zero, we obtain 
OF R A x A 
P 2( Bu = à) xı + Bixa + 0 + Bip X, + B, = 9, 
1 
OF, x A x 
ain Bix; + 2( B2 = A) x2 +e + By X_ + By = 0, 
2 
OF 4 R X R 
wee Birx + Bakx to +2( Bkk 7 A) x, +ß,=0. 
k 


These equations can be expressed as 
(BAI, )x = — 3B. (8.17) 


Equations (8.15) and (8.17) need to be solved for x,,x5,...,x, and A. This 
traditional approach, however, requires calculations that are somewhat in- 
volved. Draper (1963) proposed the following simpler, yet equivalent proce- 
dure: 


i. Regard r as a variable, but fix A instead. 


ii. Insert the selected value of A in equation (8.17) and solve for x. The 
solution is used in steps iii and iv. 


iii. Compute r = (x’x)!”’. 
iv. Evaluate ĵ(x). 


Several values of A can give rise to several stationary points which lie on 
the same hypersphere of radius r. This can be seen from the fact that if A is 
chosen to be different from any eigenvalue of B, then equation (8.17) has a 
unique solution given by 


x= —i(ĝ-A1,) Ê. (8.18) 


By substituting x in equation (8.15) we obtain 
Ê (B-A) B= 4r. (8.19) 


Hence, each value of r gives rise to at most 2k corresponding values of A. 

The choice of A has an effect on the nature of the stationary point. Some 
values of A produce points at each of which » has a maximum. Other values 
of A cause 9 to have minimum values. More specifically, suppose that A, and 
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A, are two values substituted for A in equation (8.18). Let x,,x, and r,,r, be 
the corresponding values of x and r, respectively. The following results, 
which were established in Draper (1963), can be helpful in selecting the value 
of A that produces a particular type of stationary point: 


REsULT 1. If r,=r, and A, >A,, then $, >9,, where j, and J, are the 
values of ~(x) at x, and x,, respectively. 


This result means that for two stationary points that have the same 
distance from the origin, ŷ will be larger at the stationary point with the 
larger value of A. 


RESULT 2. Let M be the matrix of second-order partial derivatives of F 
in formula (8.16), that is, 


M = 2(B — Al,). (8.20) 


If r; =r, and if M is positive definite for x, and is indefinite (that is, neither 
positive definite nor negative definite) for x,, then f; <)). 


RESULT 3. If A, is larger than the largest eigenvalue of B, then the 
corresponding solution x, in formula (8.18) is a point of absolute maximum 
for y on a hypersphere of radius r, = ix)" *. If, on the other hand, A, is 
smaller than the smallest eigenvalue of B, then x, is a point of absolute 
minimum for y on the same hypersphere. 


On the basis of Result 3 we can select several values of A that exceed the 
largest eigenvalue of B. The resulting values of the k elements of x and § can 
be plotted against the corresponding values of r. This produces k + 1 plots 
called ridge plots (see Myers, 1976, Section 5.3). They are useful in that an 
experimenter can determine, for a particular r, the maximum of ŷ within a 
region R and the operating conditions (that is, the elements of x) that give 
rise to the maximum. Similar plots can be obtained for the minimum of § 
(here, values of A that are smaller than the smallest eigenvalue of B must be 
chosen). Obviously, the portions of the ridge plots that fall outside R should 
not be considered. 


EXAMPLE 8.3.1. An experiment was conducted to investigate the effects 
of three fertilizer ingredients on the yield of snap beans under field condi- 
tions. The fertilizer ingredients and actual amounts applied were nitrogen 
(N), from 0.94 to 6.29 Ib/plot; phosphoric acid (P,O;), from 0.59 to 2.97 
Ib/plot; and potash (K,O), from 0.60 to 4.22 lb/plot. The response of 
interest is the average yield in pounds per plot of snap beans. 
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Five levels of each fertilizer were used. The levels are coded using the 
following linear transformations: 


X, — 3.62 X, — 1.78 X; — 2.42 
x, E e E x; = ——__—_ 
1.59 0.71 1.07 
Here, X,, X,, and X, denote the actual levels of nitrogen, phosphoric acid, 
and potash, respectively, used in the experiment, and x,,x,,x, the corre- 
sponding coded values. In this particular coding scheme, 3.62, 1.78, and 2.42 
are the averages of the experimental levels of X,, X,, and X}, respectively, 
that is, they represent the centers of the values of nitrogen, phosphoric acid, 
and potash, respectively. The denominators of x,, x,, and x; were chosen so 
that the second and fourth levels of each X, correspond to the values — 1 
and 1, respectively, for x, (i= 1,2,3). One advantage of such a coding 
scheme is to make the levels of the three fertilizers scale free (this is 
necessary in general, since the input variables can have different units of 
measurement). The measured and coded levels for the three fertilizers are 
shown below: 


Levels of x; (i = 1, 2,3) 


Fertilizer — 1.682 —1.000 0.000 1.000 1.682 
N 0.94 2.03 3.62 5.21 6.29 
P,O; 0.59 1.07 1.78 249 2.97 
K,O 0.60 1.35 2.42 3.49 4.22 


Combinations of the levels of the three fertilizers were applied according 
to the experimental design shown in Table 8.1, in which the design settings 
are given in terms of the coded levels. Six center-point replications were run 
in order to obtain an estimate of the experimental error variance. This 
particular design is called a central composite design [for a description of this 
design and its properties, see Khuri and Cornell (1996, Section 4.5.3)], which 
has the rotatability property. By this we mean that the prediction variance, 
that is, Var[ $(x)], is constant at all points that are equidistant from the design 
center [see Khuri and Cornell (1996, Section 2.8.3) for more detailed infor- 
mation concerning rotatability]. The corresponding response (yield) values 
are given in Table 8.1. 

A second-order model of the form given by formula (8.12) was fitted to the 
data set in Table 8.1. Thus in terms of the coded variables we have the model 


3 3 
y(x) = By + D Bixi + B2 X1X3 + Biz XX3 + Baz X2X3 + x Bax? +e. (8.21) 
i=1 i=1 
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Table 8.1. The Coded and Actual Settings of the Three Fertilizers 
and the Corresponding Response Values 


Xi X2 X3 N PO; K,O Yield y 
-1 -1 -1 2.03 1.07 1.35 11.28 
1 -1 -1 5.21 1.07 1.35 8.44 
-1 1 -1 2.03 2.49 1.35 13.19 
1 1 -1 5.21 2.49 1.35 7.71 
-1 -1 1 2.03 1.07 3.49 8.94 
1 -1 1 5.21 1.07 3.49 10.90 
-1 1 1 2.03 2.49 3.49 11.85 
1 1 1 5.21 2.49 3.49 11.03 
— 1.682 0 0 0.94 1.78 2.42 8.26 
1.682 0 0 6.29 1.78 2.42 7.87 
0 — 1.682 0 3.62 0.59 2.42 12.08 
0 1.682 0 3.62 2.97 2.42 11.06 
0 0 — 1.682 3.62 1.78 0.60 7.98 
0 0 1.682 3.62 1.78 4.22 10.43 
0 0 0 3.62 1.78 2.42 10.14 
0 0 0 3.62 1.78 2.42 10.22 
0 0 0 3.62 1.78 2.42 10.53 
0 0 0 3.62 1.78 2.42 9.50 
0 0 0 3.62 1.78 2.42 11.53 
0 0 0 3.62 1.78 2.42 11.02 


Source: A. I. Khuri and J. A. Cornell (1996). Reproduced with permission of Marcel Dekker, Inc. 


The resulting prediction equation is given by 


9$(x) = 10.462 — 0.574x, + 0.183x, + 0.456x, — 0.678x, x, + 1.183x,x, 


+ 0.233x, x, — 0.676x2 + 0.563x3 — 0.273x2. (8.22) 


Here, §(x) is the predicted yield at the point x = (x,, x», x3)’. Equation (8.22) 
can be expressed in matrix form as in equation (8.14), where B= 
(— 0.574, 0.183, 0.456)’ and B is the matrix 


` —0.676 —0.339 0.592 
B = | —0.339 0.563 0.117 |. (8.23) 
0.592 0.117 —0.273 
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The coordinates of the stationary point x, of (x) satisfy the equation 


ð 
Z = — 0.574 + 2( —0.676x, — 0.339x, + 0.592 x,) = 0, 
1 

ay 

oie = 0.183 + 2( —0.339x, + 0.563x, + 0.117x,) =0, 
Xa 

ay 

an = 0.456 + 2(0.592x, + 0.117x, — 0.273x,) = 0, 
X3 


which can be expressed as 
Ê + 2x, =0. (8.24) 
Hence, 
xo = —1B-'B = ( — 0.394, — 0.364, — 0.175)’. 


The eigenvalues of Ê are T, = 0.6508, 7, = 0.1298, 73 = — 1.1678. The matrix 
B is therefore neither positive definite nor negative definite, that is, x, is a 
saddle point (see Corollary 7.7.1). This point falls inside the experimental 
region R, which, in the space of the coded variables x4, x3, x3, is a sphere 
centered at the origin of radius V3. 

Let us now apply the method of ridge analysis to maximize ŷ inside the 
region R. For this purpose we choose values of A [the Lagrange multiplier in 
equation (8.16)] larger than 7, = 0.6508, the largest eigenvalue of B. For each 
such value of A, equation (8.17) has a solution for x that represents a point of 
absolute maximum of f(x) on a sphere of radius r = (x’x)'/? inside R. The 
results are displayed in Table 8.2. We note that at the point 
(—0.558, 1.640, 0.087), which is located near the periphery of the region R, 
the maximum value of ŷ is 13.021. By expressing the coordinates of this point 
in terms of the actual values of the three fertilizers we obtain X, = 2.733 
Ib/plot, X, = 2.944 lb/plot, and X,=2.513 lb/plot. We conclude that a 
combination of nitrogen, phosphoric acid, and potash fertilizers at the rates 


Table 8.2. Ridge Analysis Values 


A 1.906 1.166 0.979 0.889 0.840 0.808 0.784 0.770 0.754 0.745 0.740 
xy 0.106 0.170 0.221 0.269 0.316 0.362 0.408 0.453 0.499 0.544 0.558 
x2 0.102 0.269 0.438 0.605 0.771 0.935 1.099 1.263 1.426 1.589 1.640 
x3 0.081 0.110 0.118 0.120 0.117 0.113 0.108 0.102 0.096 0.089 0.087 
r 0.168 0.337 0.505 0.673 0.841 1.009 1.177 1.346 1.514 1.682 1.734 
y 10.575 10.693 10.841 11.024 11.243 11.499 11.790 12.119 12.484 12.886 13.021 
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of 2.733, 2.944, and 2.513 lb/plot, respectively, results in an estimated 
maximum yield of snap beans of 13.021 Ib /plot. 


8.3.3. Modified Ridge Analysis 


Optimization of ~(x) on a hypersphere S by the method of ridge analysis is 
justified provided that the prediction variance on S is relatively small. 
Furthermore, it is desirable that this variance remain constant on S. If not, 
then it is possible to obtain poor estimates of the optimum response, 
especially when the dispersion in the prediction variances on S is large. Thus 
the reliability of ridge analysis as an optimum-seeking procedure depends 
very much on controlling the size and variability of the prediction variance. If 
the design is rotatable, then the prediction variance, Var[ ŷ(x)], is constant on 
S. It is then easy to attain small prediction variances by restricting the 
procedure to hyperspheres of small radii. However, if the design is not 
rotatable, then Var[$(x)] may vary widely on S, which, as was mentioned 
earlier, can adversely affect the quality of estimation of the optimum re- 
sponse. This suggests that the prediction variance should be given serious 
consideration in the strategy of ridge analysis if the design used is not 
rotatable. 

Khuri and Myers (1979) proposed a certain modification to the method of 
ridge analysis: one that optimizes ĵ(x) subject to a particular constraint on 
the prediction variance. The following is a description of their proposed 
modification: 

Consider model (8.12), which can be written as 


y(x) =f (x)y +e, (8.25) 

where 
£ (5) = (1; Xis Hage Xps Ry has Arkae is MEM Ma Bay Xk) o 
y= ( Bo; Bi; Bises Bk: By, Bises Bk-1,k> Bii: Boa sees Bix)’: 
The predicted response is given by 
$x) =£'(x)¥, (8.26) 

where ¥ is the least-squares estimator of y, namely, 

4 =(X'X) X'y, (8.27) 


where X = [f(x,), f(x,),...,f(x,,)]’ with x; being the vector of design settings 
at the ith experimental run (i= 1,2,...,, where n is the number of runs 
used in the experiment), and y is the corresponding vector of n observations. 
Since 


Var(¥) =(X'X) ‘0’, (8.28) 
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where o” is the error variance, then from equation (8.26), the prediction 
variance is of the form 


Var[$(x)] = 07f'(x)(X’X) f(x). (8.29) 


The number of unknown parameters in model (8.25) is p = (k + 1Xk + 2)/2, 
where k is the number of input variables. Let v4, v2,..., v, denote the 
eigenvalues of X’X. Then from equation (8.29) and Theorem 2.3.16 we have 


< Var| $(x)] < ee) 


max min 


o°f'(x)f(x) 


where Vain and Vaas are, respectively, the smallest and largest of the vys. 
This double inequality shows that the prediction variance can be inflated if 
X'X has small eigenvalues. This occurs when the columns of X are multi- 
collinear (see, for example, Myers, 1990, pages 125-126 and Chapter 8). 
Now, by the spectral decomposition theorem (Theorem 2.3.10), X’X = VAV’, 
where V is an orthogonal matrix of orthonormal eigenvectors of X’X and 
A = Diag(1,, v2,..., v,) is a diagonal matrix of eigenvalues of X’X. Equation 
(8.29) can then be written as 


AOM 


Var[ 9(x)]} =o? (8.30) 
j=l j 
where v; is the jth column of V (j =1,2,..., p). If we denote the elements of 
Vj DY Voj Vij- Ukjs V12 jo Vrs jr -+s Uk-1 kjo Vip Vr jr+++> Vex; then f'(x)v; can 
be expressed as 
f'(x)v; = voj + x't; + x'T}x, jJ=1,2,...,p, (8.31) 
where 4 = (Uj, Vj: ++, Vj)’ and 
1 1 1 
Virj Z012; 213; Mik; 
1 1 
Vo9; 293; WV 2Kj 
T; = , J bgi 1, 2; # P 
1 
ZUk-1kj 
symmetric Ukkj 


We note that the form of f'(x)v, as given by formula (8.31), is identical to 
that of a second-order model. Formula (8.30) can then be written as 


(Voj + x't, + x'T,x) 


p 
Var[ 9(x)| = o? D (8.32) 
j=1 


v; 
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As was noted earlier, small values of v; (j =1,2,..., p) cause (x) to have 
large variances. 

To reduce the size of the prediction variance within the region explored by 
ridge analysis, we can consider putting constraints on the portion of Varl ĵ(x)] 
that corresponds to v,;,. It makes sense to optimize (x) subject to the 
constraints 


xx Sr, (8.33) 


|Vom + XT, +x'T,,X| <q, (8.34) 


where Vom» Tm» and T, are the values of vo, T, and T that correspond to vain- 
Here, q is a positive constant chosen small enough to offset the small value 
of vain: Khuri and Myers (1979) suggested that q be equal to the largest 
value taken by |Vom +x't,, + x’T,,x| at the n design points. The rationale 
behind this rule of thumb is that the prediction variance is smaller at the 
design points than at other points in the experimental region. 

The modification suggested by Khuri and Myers (1979) amounts to adding 
the constraint (8.34) to the usual procedure of ridge analysis. In this way, 
some control can be maintained on the size of prediction variance during the 
optimization process. The mathematical algorithm needed for this con- 
strained optimization is based on a technique introduced by Myers and 
Carter (1973) for a dual response system in which a primary second-order 
response function is optimized subject to the condition that a constrained 
second-order response function takes on some specified or desirable values. 
Here, the primary response is f(x) and the constrained response is Vom + 
x't,, + x’T,,X. 

Myers and Carter’s (1973) procedure is based on the method of Lagrange 
multipliers, which uses the function 


L= Bo F x'Ê + x'Êx — M(Vom + X'Tm + X'T,,X — @) — A(x'’x =r’), 


where Bos B, and B are the same as in model (8.14), u and A are Lagrange 
multipliers, w is such that | œ| <q [see inequality (8.34)], and r is the radius 
of a hypersphere centered at the origin and contained inside a region of 
interest R. By differentiating L with respect to x,,x,,...,x, and equating 
the derivatives to zero, we obtain 


A 


(Ê — uTn — Al, )x = 3( utn — Ê). (8.35) 


As in the method of ridge analysis, to solve equation (8.35), values of u and 
à are chosen directly in such a way that the solution represents a point of 
maximum (or minimum) for (x). Thus for a given value of u, the matrix of 
second-order partial derivatives of L, namely 2(B — uT,„ — AI,), is made 
negative definite [and hence a maximum of f(x) is achieved] by selecting A 
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larger than the largest eigenvalue of B- pT,,. Values of A smaller than the 
smallest eigenvalue of B— uT, should be considered in order for $(x) to 
attain a minimum. It follows that for such an assignment of values for u and 
A, the corresponding solution of equation (8.35) produces an optimum for 9 
subject to a fixed r = (x’x)'”” and a fixed value of Von + x't,, + x’T,,x. 


EXAMPLE 8.3.2. An attempt was made to design an experiment from 
which one could find conditions on concentration of three basic substances 
that maximize a certain mechanical modular property of a solid propellant. 
The initial intent was to construct and use a central composite design (see 
Khuri and Cornell, 1996, Section 4.5.3) for the three components in the 
system. However, certain experimental difficulties prohibited the use of the 
design as planned, and the design used led to problems with multicollinearity 
as far as the fitting of the second-order model is concerned. The design 
settings and corresponding response values are given in Table 8.3. 

In this example, the smallest eigenvalue of X’X is vain = 0.0321. Corre- 
spondingly, the values of Vom, Tm, and T,, in inequality (8.34) are Von = 
— 0.2935, 7,,, = (0.0469, 0.4081, 0.4071)’, and 


0.1129 0.0095 0.2709 
T,, =| 0.0095 —0.1382 —0.0148 |. 
0.2709 —0.0148 0.6453 


As for q in inequality (8.34), values of |v ,, +x't,, +x’T,,x| were computed 
at each of the 15 design points in Table 8.3. The largest value was found to 


Table 8.3. Design Settings and Response Values for Example 8.3.2 


xy X2 X3 y 
— 1.020 — 1.402 — 0.998 13.5977 
0.900 0.478 — 0.818 12.7838 
0.870 — 1.282 0.882 16.2780 
— 0.950 0.458 0.972 14.1678 
— 0.930 — 1.242 — 0.868 9.2461 
0.750 0.498 — 0.618 17.0167 
0.830 — 1.092 0.732 13.4253 
— 0.950 0.378 0.832 16.0967 
1.950 — 0.462 0.002 14.5438 
— 2.150 — 0.402 — 0.038 20.9534 
— 0.550 0.058 —0.518 11.0411 
— 0.450 1.378 0.182 21.2088 
0.150 1.208 0.082 25.5514 
0.100 1.768 — 0.008 33.3793 
1.450 — 0.342 0.182 15.4341 


Source: Khuri and Myers (1979). Reproduced with permission of the Ameri- 


can Statistical Association. 
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Table 8.4. Results of Modified Ridge Analysis 


r 0.848 1.162 1.530 1.623 1.795 1.850 1.904 1.935 2.000 
lal 0.006 0.074 0.136 1.139 0.048 0.086 0.126 0.146 0.165 
Varl $(@®)|/o7] 1.170 1.635 2.922 3.147 1.305 2.330 3.510 4.177 5.336 
xy 0.410 0.563 0.773 0.785 0.405 0.601 0.750 0.820 0.965 
X2 0.737 1.015 1.320 1.422 1.752 1.751 1.750 1.752 1.752 
X3 0.097 0.063 0.019 0.011 0.000 0.000 0.012 0.015 —0.030 
fa) 22.420 27.780 35.242 37.190 37.042 40.222 42.830 44.110 46.260 


Source: Khuri and Myers (1979). Reproduced with permission of the American Statistical 
Association. 


Table 8.5. Results of Standard Ridge Analysis 


r 0.140 0.379 0.698 0.938 1.146 1.394 1.484 1.744 1.944 1.975 2.000 
lol 0.241 0.124 0.104 0.337 0.587 0.942 1.085 1.553 1.958 2.025 2.080 
o? / Vmin 1.804 0.477 0.337 3.543 10.718 27.631 36.641 75.163 119.371 127.815 134.735 
Var[$@)/o2] 2.592 1.554 2.104 6.138 14.305 32.787 42.475 83.38 129.834 138.668 145.907 
x 0.037 0.152 0.352 0.515 0.660 0.835 0.899 1.085 1.227 1.249 1.265 
X> 0.103 0.255 0.422 0.531 0.618 0.716 0.749 0.845 0.916 0.927 0.936 
x3 0.087 0.235 0.431 0.577 0.705 0.858 0.912 1.074 1.197 1.217 1.232 
$(x) 12.796 16.021 21.365 26.229 31.086 37.640 40.197 48.332 55.147 56.272 57.176 


Source: Khuri and Myers (1979). Reproduced with permission of the American Statistical 
Association. 


be 0.087. Hence, the value of |v ,, + x't,, + x’T,,x| should not grow much 
larger than 0.09 in the experimental region. Furthermore, r in equation 
(8.33) must not exceed the value 2, since most of the design points are 
contained inside a sphere of radius 2. The results of maximizing $(x) subject 
to this dual constraint are given in Table 8.4. For the sake of comparison, the 
results of applying the standard procedure of ridge analysis (that is, without 
the additional constraint concerning Vom +x't,, + x'T,,x) are displayed in 
Table 8.5. 

It is clear from Tables 8.4 and 8.5 that the extra constraint concerning 
Vom + X'T,, +x'T,,x has profoundly improved the precision of j at the esti- 
mated maxima. At a specified radius, the value of » obtained under standard 
ridge analysis is higher than the one obtained under modified ridge analysis. 
However, the prediction variance values under the latter procedure are much 
smaller, as can be seen from comparing Tables 8.4 and 8.5. While the 
tradeoff that exists between a high response value and a small prediction 
variance is a bit difficult to cope with from a decision making standpoint, 
there is a clear superiority of the results displayed in Table 8.4. For example, 
one would hardly choose any operating conditions in Table 8.5 that indicate 
y= 50, due to the accompanying large prediction variances. On the other 
hand, Table 8.4 reveals that at radius r = 2.000, § = 46.26 with Varl $(x)]/o? 
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= 5.336, while a rival set of coordinates at r= 1.744 for standard ridge 
analysis gives $ = 48.332 with Var[ $(x)]/o = 83.38. 

Row 3 of Table 8.5 gives values of w*/v,i,, Which should be compared 
with the corresponding values in row 4 of the same table. One can easily see 
that in this example, w*/v,,,, accounts for a large portion of Varl 9@)]/o’. 


in 


8.4. RESPONSE SURFACE DESIGNS 


We recall from Section 8.3 that one of the objectives of response surface 
methodology is the selection of a response surface design according to a 
certain optimality criterion. The design selection entails the specification of 
the settings of a group of input variables that can be used as experimental 
runs in a given experiment. 

The proper choice of a response surface design can have a profound effect 
on the success of a response surface exploration. To see this, let us suppose 
that the fitted model is linear of the form 


y=XB te, (8.36) 


where y is an n X 1 vector of observations, X is an n X p known matrix that 
depends on the design settings, B is a vector of p unknown parameters, and 
€ is a vector of random errors in the elements of y. Typically, € is assumed to 
have the normal distribution N(0, o7I,,), where a? is unknown. In this case, 
the vector B is estimated by the least-squares estimator B, which is given by 


Ê = (X’X) X'y. (8.37) 
If x1, X2,..., Xp are the input variables for the model under consideration, 
then the predicted response at a point x =(x,,x,,...,x,)' in a region of 
interest R is written as 

P(x) =f (x)B, (8.38) 
where f(x) is a p X 1 vector whose first element is equal to one and whose 
remaining p — 1 elements are functions of x4, X3,..., X. These functions are 


in the form of powers and cross products of powers of the x,’s up to degree d. 
In this case, the model is said to be of order d. At the uth experimental run, 
Xa = (Xu Xu ---Xuk) and the corresponding response value is y,(u = 
1,2,...,2). Then n Xk matrix D =[x,: x: = :x,]’ is the design matrix. Thus 
by a choice of design we mean the specification of the elements of D. 

If model (8.36) is correct, then Ê is an unbiased estimator of B and its 
variance—covariance matrix is given by 


Var(B) = (X'X) 'o?. (8.39) 
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Hence, from formula (8.38), the prediction variance can be written as 
Var[$(x)] = 07f'(x)(X’X) E(x). (8.40) 


The design D is rotatable if Var[_¥(x)] remains constant at all points that are 
equidistant from the design center, as we may recall from Section 8.3. The 
input variables are coded so that the center of the design coincides with the 
origin of the coordinates system (see Khuri and Cornell, 1996, Section 2.8). 


8.4.1. First-Order Designs 


If model (8.36) is of the first order (that is, d = 1), then the matrix X is of the 
form X =[1,: D], where 1, is a vector of ones of order n X 1 [see model 
(8.8)]. The input variables can be coded in such a way that the sum of the 
elements in each column of D is equal to zero. Consequently, the prediction 
variance in formula (8.40) can be written as 


Var[ $(x)] = 0? 


Lx)’. (8.41) 


Formula (8.41) clearly shows the dependence of the prediction variance on 
the design matrix. 

A reasonable criterion for the choice of D is the minimization of Var[ ŷ(x)], 
or equivalently, the minimization of x'(D'D)~'x within the region R. To 
accomplish this we first note that for any x in the region R, 


x'(D'D) ‘x <||x||3||(D'D) "lz, (8.42) 


where ||x||, = (x’x)!” and ||(D’D)~!||, = EE sc ie is the Euclidean 
norm of (D'D)~! with d” being its (i, j)th element (i,j =1,2,...,k). In- 
equality (8.42) follows from applying Theorems 2.3.16 and 2.3.20. Thus by 
choosing the design D so that it minimizes |(D'D)~'||,, the quantity 
x'(D'D)~!x, and hence the prediction variance, can be reduced throughout 
the region R. 


Theorem 8.4.1. For a given number n of experimental runs, |(D'D)~'|l, 
attains its minimum if the columns d,,d,,...,d, of D are such that d;d; = 0, 
i +j, and d'd, is as large as possible inside the region R. 


Proof. We have that D =[d,:d,: --::d,]. The elements of d, are the n 
design settings of the input variable x; ((=1,2,...,k). Suppose that the 
region R places the following restrictions on these settings: 


did,<c?,  i=1,2,...,k, (8.43) 


Cr SE. 
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where c; is some fixed constant. This means that the spread of the design in 
the direction of the ith coordinate axis is bounded by c? (i= 1,2,...,k). 
Oo 


Now, if d; denotes the ith diagonal element of D’D, then d; = d'd; 
(i =1,2,...,k). Furthermore, 


dï > i=1,2,...,k, (8.44) 


1 
di; 


where d” is the ith diagonal element of (D’D)~'. 

To prove inequality (8.44), let D; be a matrix of order n x (k — 1) obtained 
from D by removing its ith column d; (i = 1,2,..., k). The cofactor of d;; in 
D'D is then det(D;D,). Hence, from Section 2.3.3, 


,, det(D;D;) 
det(D'D) ’ 


i=1,2,...,k. (8.45) 


There exists an orthogonal matrix E; of order k X k (whose determinant has 
an absolute value of one) such that the first column of DE; is d; and the 
remaining columns are the same as those of D,, that is, 


DE,=[d,:D,],  i=1,2,...,k. 
It follows that [see property 7 in Section 2.3.3] 
det(D'D) = det(E,D’DE,) 
= det(D;D;) |dd, — 4,D,(D{D,) ‘Did, ]. 


L 


Hence, from (8.45) we obtain 
PA _ =i 
d" = |d; — d; D;(D;D,) 'D;a,] > i=1,2,...,k. (8.46) 


Inequality (8.44) now follows from formula (8.46), since d’,D,(D/D,)~'D/d; > 0. 
We can therefore write 


k el 1/2 k 1 72 
low k| za] Eg] 
i=l i ii 
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Using the restrictions (8.43) we then have 


k 1 1/2 
eva 
lo» k$ g| 
=1 Ĉi 
Equality is achieved if the columns of D are orthogonal to one another and 


dą =c? (i=1,2,...,k). This follows from the fact that d = 1/d,; if and only 
if d;D, =0 (¿= 1,2,..., k), as can be seen from formula (8.46). 


L 


Definition 8.4.1. A design for fitting a fist-order model is said to be 
orthogonal if its columns are orthogonal to one another. m 


Corollary 8.4.1. For a given number n of experimental runs, Var( B) 
attains a minimum if and only if the design is orthogonal, where 6; is the 
least-squares estimator of 8; in model (8.8), i = 1,2,..., k. 


Proof. This follows directly from Theorem 8.4.1 and the fact that Var( B) 
= 07d" (i=1,2,...,k), as can be seen from formula (8.39). o 


From Theorem 8.4.1 and Corollary 8.4.1 we conclude that an orthogonal 
design for fitting a first-order model has optimal variance properties. An- 
other advantage of orthogonal first-order designs is that the effects of the k 
input variables in model (8.8), as measured by the values of the 6s (i= 
1,2,..., k), can be estimated independently. This is because the off-diagonal 
elements of the variance-covariance matrix of B in formula (8.39) are zero. 
This means that the elements of B are uncorrelated and hence statistically 
independent under the assumption of normality of the random error vector € 
in model (8.36). 

Examples of first-order orthogonal designs are given in Khuri and Cornell 
(1996, Chapter 3). Prominent among these designs are the 2* factorial design 
(each input variable has two levels, and the number of all possible combina- 
tions of these levels is 2*) and the Plackett—Burman design, which was 
introduced in Plackett and Burman (1946). In the latter design, the number 
of design points is equal to k + 1, which must be a multiple of 4. 


8.4.2. Second-Order Designs 


These designs are used to fit second-order models of the form given by (8.12). 
Since the number of parameters in this model is p =(k+ 1)(k + 2)/2, the 
number of experimental runs (or design points) in a second-order design 
must at least be equal to p. The most frequently used second-order designs 
include the 3* design (each input variable has three levels, and the number of 
all possible combinations of these levels is 3*), the central composite design 
(CCD), and the Box—Behnken design. 
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The CCD was introduced by Box and Wilson (1951). It is made up of a 
factorial portion consisting of a 2* factorial design, an axial portion of k 
pairs of points with the ith pair consisting of two symmetric points on the ith 
coordinate axis (i=1,2,...,k) at a distance of a (>0) from the design 
center (which coincides with the center of the coordinates system by the 
coding scheme), and n, (> 1) center-point runs. The values of œ and ng can 
be chosen so that the CCD acquires certain desirable features (see, for 
example, Khuri and Cornell, 1996, Section 4.5.3). In particular, if a=F'/4, 
where F denotes the number of points in the factorial portion, then the CCD 
is rotatable. The choice of nọ can affect the stability of the prediction 
variance. 

The Box-Behnken design, introduced in Box and Behnken (1960), is a 
subset of a 3* factorial design and, in general, requires many fewer points. It 
also compares favorably with the CCD. A thorough description of this design 
is given in Box and Draper (1987, Section 15.4). 

Other examples of second-order designs are given in Khuri and Cornell 
(1996, Chapter 4). 


8.4.3. Variance and Bias Design Criteria 


We have seen that the minimization of the prediction variance represents an 
important criterion for the selection of a response surface design. This 
criterion, however, presumes that the fitted model is correct. There are many 
situations in which bias in the predicted response can occur due to fitting the 
wrong model. We refer to this as model bias. 

Box and Draper (1959, 1963) presented convincing arguments in favor of 
recognizing bias as an important design criterion—in certain cases, even 
more important than the variance criterion. 

Consider again model (8.36). The response value at a point x= 
(x,,X>,...,%,)' in a region R is represented as 


y(x) =f'(x)B +e, (8.47) 


where f'(x) is the same as in model (8.38). While it is hoped that model (8.47) 
is correct, there is always a fear that the true model is different. Let us 
therefore suppose that in reality the true mean response at x, denoted by 
n(x), is given by 


n(x) =f'(x)B + g'(x)ò (8.48) 


where the elements of g'(x) depend on x and consist of powers and cross 
products of powers of x4, x,,...,x, of degree d’ >d, with d being the order 
of model (8.47), and 6 is a vector of q unknown parameters. For a given 
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design D of n experimental runs, we then have the model 
n=XBP + ZB, 


where y is the vector of true means (or expected values) of the elements of y 
at the n design points, X is the same as in model (8.36), and Z is a matrix of 
order n Xq whose uth row is equal to g’(x,,). Here, x’, denotes the uth row 
of D (u = 1,2,...,7). 

At each point x in R, the mean squared error (MSE) of 9(x), where f(x) is 
the predicted response as given by formula (8.38), is defined as 


MSE[9(x)] = E[9(x) — n]. 
This can be expressed as 
MSE[ $(x)] = Var[$(x)] + Bias*[$(x)], (8.49) 
where Bias[ (x) = E[$(x)] — n(x). The fundamental philosophy of Box and 
Draper (1959, 1963) is centered around the consideration of the integrated 


mean squared error (IMSE) of $(x). This is denoted by J and is defined in 
terms of a k-tuple Riemann integral over the region R, namely, 


nQ 
f= J MSE| 9(x)] dx, (8.50) 


where 0~'= fękdx and o” is the error variance. The partitioning of 
MSE[ ĵ(x)] as in formula (8.49) enables us to separate J into two parts: 


J= 2 Var| ŷ d f Bias’ [ 9 dx=V+B 8.51 
= — +— =V +B. ; 
zf ar| §(x)| dx aE ias’[ $(x)] dx (8.51) 


The quantities V and B are called the average variance and average squared 
bias of (x), respectively. Both V and B depend on the design D. Thus a 
reasonable choice of design is one that minimizes (1) V alone, (2) B alone, or 
(3) J=V+B. 

Now, using formula (8.40), V can be written as 


Van | £'(x)(X'X) E) dx 


= whao [of rear) axl} 


= tr[n(X’X) Ta], (8.52) 


RESPONSE SURFACE DESIGNS 
where 
Pu = Of £(x)£"() dx. 
R 
As for B, we note from formula (8.37) that 
E(B) =(X'X) 'X'y 
=B+A8, 
where A = (X’X)~!X’Z. Thus from formula (8.38) we have 
E|$(x)] =f'(x)(B + A5). 


Using the expression for n(x) in formula (8.48), B can be written as 


nQ 5 

za JEB + f)AB— FC) B — g'(x)8] dx 
nQ 

z oe Slt) - g'(x)8]? dx 


nQ 
= za S'TA) -80A - g'(9)]8 dx 
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nQ 
= Jè [AE OA — g(x)f'(x)A — A'f (x)g' (x) + g(x)g'(x)]ò dx 


z 3'ASd 
= : 


where 


A=AT,A-T,A-AT, +T yy, 
T= Of f(x)g'(x) dx, 
R 


r,,=0 ik g(x)g' (x) dx. 


(8.53) 


The matrices F4, F, and F, are called region moments. By adding and 
subtracting the matrix r r'r from A in formula (8.53), B can be 


expressed as 


n 
B= z0 [(T2 —1T, Pail) + (A-ra ra) Tu(A -r r)a. (8-54) 


2 
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We note that the design D affects only the second expression inside brackets 
on the right-hand side of formula (8.54). Thus to minimize B, the design D 
should be chosen such that 


A-3 r, =0. (8.55) 


Since A =(X'X)~!X’Z, a sufficient (but not necessary) condition for the 
minimization of B is 


Mı =F}, Mp =T, (8.56) 


where M,, =(1/n)X'X, M, =(1/n)X’Z are the so-called design moments. 
Thus a sufficient condition for the minimization of B is the equality of the 
design moments, M,, and M,,, to the corresponding region moments, F44 
and T}. 

The minimization of J = V + B is not possible without the specification of 
5/o. Box and Draper (1959, 1963) showed that unless V is considerably 
larger than B, the optimal design that minimizes J has characteristics similar 
to those of a design that minimizes just B. 

Examples of designs that minimize V alone or B alone can be found in 
Box and Draper (1987, Chapter 13), Khuri and Cornell (1996, Chapter 6), 
and Myers (1976, Chapter 9). 


8.5. ALPHABETIC OPTIMALITY OF DESIGNS 


Let us again consider model (8.47), which we now assume to be correct, that 
is, the true mean response, n(x), is equal to f’(x)B. In this case, the matrix 
X’'X plays an important role in the determination of an optimal design, since 
the elements of (X'X)~' are proportional to the variances and covariances of 
the least-squares estimators of the model’s parameters [see formula (8.39)]. 
The mathematical theory of optimal designs, which was developed by 
Kiefer (1958, 1959, 1960, 1961, 1962a, b), is concerned with the choice of de- 
signs that minimize certain functions of the elements of (X'X)~'. The kernel 
of Kiefer’s approach is based on the concept of design measure, which 
represents a generalization of the traditional design concept. So far, each of 
the designs that we have considered for fitting a response surface model has 
consisted of a set of n points in a k-dimensional space (k > 1). Suppose that 
X|,X>,---,X,, are distinct points of an n-point design (m <n) with the /th 
(J=1,2,...,m) point being replicated n, (> 1) times (that is, n, repeated 
observations are taken at this point). The design can therefore be regarded as 
a collection of points in a region of interest R with the /th point being 
assigned the weight n,/n (l= 1,2,..., m), where n = X7,n,. Kiefer general- 
ized this setup using the so-called continuous design measure, which is 
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basically a probability measure &(x) defined on R and satisfies the conditions 
é(x) >0 forallxeR and [de =1. 
R 


In particular, the measure induced by a traditional design D with n points is 
called a discrete design measure and is denoted by &,. It should be noted that 
while a discrete design measure is realizable in practice, the same is not true 
of a general continuous design measure. For this reason, the former design is 
called exact and the latter design is called approximate. 

By definition, the moment matrix of a design measure é is a symmetric 
matrix of the form M( é) = [m;;( é )], where 


myl E) = f EOF dE). (8.57) 


Here, f;(x) is the ith element of f(x) in formula (8.47), i= 1,2,..., p. For a 
discrete design measure &,, the (i, j)th element of the moment matrix is 


m 


malg) =— E mihi )G (80) (8.58) 


l=1 


where m is the number of distinct design points and n, is the number of 
replications at the /th point 7 =1,2,...,m). In this special case, the matrix 
M( £) reduces to the usual moment matrix (1/n)X'X, where X is the same 
matrix as in formula (8.36). 

For a general design measure &, the standardized prediction variance, 
denoted by d(x, £), is defined as 


d(x, €) =f"(x)[M(é)] E(x), (8.59) 


where M( £) is assumed to be nonsingular. In particular, for a discrete design 
measure é, the prediction variance in formula (8.40) is equal to 
(a? /n) d(x, €,). 

Let H denote the class of all design measures defined on the region R. 
A prominent design criterion that has received a great deal of attention is 
that of D-optimality, in which the determinant of M( é) is maximized. Thus a 
design measure é; is D-optimal if 


det[M(é,)] = a det[M(é)]. (8.60) 


The rationale behind this criterion has to do with the minimization of the 
generalized variance of the least-squares estimator B of the parameter vector 
B. By definition, the generalized variance of B is the same as the determi- 
nant of the variance—covariance matrix of B. This is based on the fact that 
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under the normality assumption, the content (volume) of a fixed-level confi- 
dence region on B is proportional to [det(X’X)]~'/*. The review articles by 
St. John and Draper (1975), Ash and Hedayat (1978), and Atkinson 
(1982, 1988) contain many references on D-optimality. 

Another design criterion that is closely related to D-optimality is G- 
optimality, which is concerned with the prediction variance. By definition, a 
design measure € is G-optimal if it minimizes over H the maximum 
standardized prediction variance over the region R, that is, 


up d(x, &,) = m (sup d(x, cy}. (8.61) 


Kiefer and Wolfowitz (1960) showed that D-optimality and G-optimality, as 
defined by formulas (8.60) and (8.61), are equivalent. Furthermore, a design 
measure é* is G-optimal (or D-optimal) if and only if 


sup d(x, €*) =p, (8.62) 
xER 
where p is the number of parameters in the model. Formula (8.62) can be 
conveniently used to determine if a given design measure is D-optimal, since 
in general sup, c p d(x, £) >p for any design measure é€ H. If equality can 
be achieved by a design measure, then it must be G-optimal, and hence 
D-optimal. 


EXAMPLE 8.5.1. Consider fitting a second-order model in one input 
variable x over the region R=[-—1,1]. In this case, model (8.47) takes the 
form y(x) = By + B,x + Bux’ + e, that is, f'(x) = (1, x, x*). Suppose that the 
design measure used is defined as 


al: Z= 
E(x) = 39 x 1,0,1, 


f (8.63) 
0 otherwise. 


Thus é is a discrete design measure that assigns one-third of the experimen- 
tal runs to each of the points —1, 0, and 1. This design measure is 
D-optimal. To verify this claim, we first need to determine the values of the 
elements of the moment matrix M( €). Using formula (8.58) with n,/n = + 
for L= 1,2,3, we find that m; = 1, mp =0, my = 4, My = $, my, =0, and 
mz = 4. Hence, 


1 0 ¢ 
M(é)=|0 3 0j, 
530 3 
3 0 -3 
M-'(£)= 3 ol. 
= 0 2 
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By applying formula (8.59) we find that d(x, £) =3 — 3x? + 3x4,-1<x<1. 
We note that d(x,é)<3 for all x in [—1,1] with d(0,é)=3. Thus 
sup, e g d(x, £) = 3. Since 3 is the number of parameters in the model, then 
by condition (8.62) we conclude that the design measure defined by formula 
(8.63) is D-optimal. 


In addition to the D- and G-optimality criteria, other variance-related 
design criteria have also been investigated. These include A- and F-optimal- 
ity. By definition, a design measure is A-optimal if it maximizes the trace of 
M(é). This is equivalent to minimizing the sum of the variances of the 
least-squares estimators of the fitted model’s parameters. In E-optimality, 
the smallest eigenvalue of M(é) is maximized. The rationale behind this 
criterion is based on the fact that 


min 


as can be seen from formula (8.59), where Amin is the smallest eigenvalue of 
M( £). Hence, d(x, é) can be reduced by maximizing Amin- 

The efficiency of a design measure ¢@H with respect to a D-optimal 
design is defined as 


1/p 


det[M( ¢ )] 
sup, < y det[M( £ )] i 


D-efficiency = 
where p is the number of parameters in the model. Similarly, the G- 
efficiency of ¢ is defined as 


P 
SUP, e r d(x, $) ` 


G-efficiency = 


Both D- and G-efficiency values fall within the interval [0,1]. The closer 
these values are to 1, the more efficient their corresponding designs are. 
Lucas (1976) compared several second-order designs (such as central compos- 
ite and Box-Behnken designs) on the basis of their D- and G-efficiency 
values. 

The equivalence theorem of Kiefer and Wolfowitz (1960) can be applied 
to construct a D-optimal design using a sequential procedure. This proce- 
dure is described in Wynn (1970, 1972) and goes as follows: Let D,,, denote an 
initial response surface design with nọ points, XX2. Xn for which the 
matrix X'X is nonsingular. A point x, ,,; is found in the region R such that 


A(X 41> ny) = sup d(x, Enb 
xER 
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where é,, is the discrete design measure that represents D, By augmenting 
D,, with x,,,, we obtain the design D,,,,. Then, another point x, +2 is 
chosen such that 


d(x, +2 > a) = sup d(x, Seep 
xER 
where é, +1 18 the discrete design measure that represents D,, +1- The point 
X,,,+2 Ís then added to D,, ,, to obtain the design D,,,,,. By continuing this 
process we obtain a sequence of discrete design measures, namely, 
Eno» Éngt1> Eng +29+++ + Wynn (1970) showed that this sequence converges to the 
D-optimal design é}, that is, 


det[M(é,,.,)] > det[M( €,)] 


as n => œ, An example is given in Wynn (1970, Section 5) to illustrate this 
sequential procedure. 

The four design criteria, A-, D-, E-, and G-optimality, are referred to as 
alphabetic optimality. More detailed information about these criteria can be 
found in Atkinson (1982, 1988), Fedorov (1972), Pazman (1986), and Silvey 
(1980). Recall that to perform an actual experiment, one must use a discrete 
design. It is possible to find a discrete design measure é, that approximates 
an optimal design measure. The approximation is good whenever n is large 
with respect to p (the number of parameters in the model). 

Note that the equivalence theorem of Kiefer and Wolfowitz (1960) applies 
to general design measures and not necessarily to discrete design measures, 
that is, D- and G-optimality criteria are not equivalent for the class of 
discrete design measures. Optimal n-point discrete designs, however, can still 
be found on the basis of maximizing the determinant of X'X, for example. In 
this case, finding an optimal n-point design requires a search involving nk 
variables, where k is the number of input variables. Several algorithms have 
been introduced for this purpose. For example, the DETMAX algorithm by 
Mitchell (1974) is used to maximize det(X’X). A review of algorithms for 
constructing optimal discrete designs can be found in Cook and Nachtsheim 
(1980) (see also Johnson and Nachtsheim, 1983). 

One important criticism of the alphabetic optimality approach is that it is 
set within a rigid framework governed by a set of assumptions. For example, 
a specific model for the response function must be assumed as the “true” 
model. Optimal design measures can be quite sensitive to this assumption. 
Box (1982) presented a critique to this approach. He argued that in a 
response surface situation, it may not be realistic to assume that a model 
such as (8.47) represents the true response function exactly. Some protection 
against bias in the model should therefore be considered when choosing a 
response surface design. On the other hand, Kiefer (1975) criticized certain 
aspects of the preoccupation with bias, pointing out examples in which the 
variance criterion is compromised for the sake of the bias criterion. It follows 
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that design selection should be guided by more than one single criterion (see 
Kiefer, 1975, page 286; Box, 1982, Section 7). A reasonable approach is to 
select compromise designs that are sufficiently good (but not necessarily 
optimal) from the viewpoint of several criteria that are important to the user. 


8.6. DESIGNS FOR NONLINEAR MODELS 


The models we have considered so far in the area of response surface 
methodology were linear in the parameters; hence the term linear models. 
There are, however, many experimental situations in which linear models do 
not adequately represent the true mean response. For example, the growth 
of an organism is more appropriately depicted by a nonlinear model. By 
definition, a nonlinear model is one of the form 


y(x) =h(x, 0) +€, (8.64) 


where x = (x,,X),...,X,)’ is a vector of k input variables, © = (6,, 4),..., 6)’ 
is a vector of p unknown parameters, € is a random error, and h(x, 8) is a 
known function, nonlinear in at least one element of @. An example of a 
nonlinear model is 


h(x, 6) = ety 
2 


Here, 0 =(6,,6,)’ and 6, is a nonlinear parameter. This particular model is 
known as the Michaelis-Menten model for enzyme kinetics. It relates the 
initial velocity of an enzymatic reaction to the substrate concentration x. 

In contrast to linear models, nonlinear models have not received a great 
deal of attention in response surface methodology, especially in the design 
area. The main design criterion for nonlinear models is the D-optimality 
criterion, which actually applies to a linearized form of the nonlinear model. 
More specifically, this criterion depends on the assumption that in some 
neighborhood of a specified value 0, of 0, the function h(x, 0) is approxi- 
mately linear in 0. In this case, a first-order Taylor’s expansion of h(x, 0) 
yields the following approximation of h(x, 0): 


P óh(x, 89) 
h(x,0) =h(x, 0) + 2 (6; — Bo = 
i=1 i 


Thus if @ is close enough to 0,, then we have approximately the linear model 


P h(x, a 
z(x)= La Å i ie €, (8.65) 


where z(x) =y(x) — h(x, 0), and y is the ith element of ẹ\ = 0 — 0, (i= 
1,2,..., p). 
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For a given design consisting of n experimental runs, model (8.65) can be 
written in vector form as 


z=H(0))b+e, (8.66) 


where H(0,) is an n X p matrix whose (u, i)th element is dA(x,,,0))/00; with 
x, being the vector of design settings for the k input variables at the uth 
experimental run (i= 1,2,..., p; u=1,2,...,n). Using the linearized form 
given by model (8.66), a design is chosen to maximize the determinant 
det[H’(8,)H(8,)]. This is known as the Box—Lucas criterion (see Box and 
Lucas, 1959). 

It can be easily seen that a nonlinear design obtained on the basis of the 
Box-—Lucas criterion depends on the value of @). This is an undesirable 
characteristic of nonlinear models, since a design is supposed to be used for 
estimating the unknown parameter vector 0. By contrast, designs for linear 
models are not dependent on the fitted model’s parameters. Several proce- 
dures have been proposed for dealing with the problem of design depen- 
dence on the parameters of a nonlinear model. These procedures are 
mentioned in the review article by Myers, Khuri, and Carter (1989). See also 
Khuri and Cornell (1996, Section 10.5). 


EXAMPLE 8.6.1. Let us again consider the Michaelis-Menten model 
mentioned earlier. The partial derivatives of h(x, ®) with respect to 6, and 
0, are 


ðh(x, 0) x 
A 0, +x? 
dh( x, 0) — 0x 
06, (05 +x) 


Suppose that it is desired to find a two-point design that consists of the 
settings x, and x, using the Box—Lucas criterion. In this case, 


Dh(x,,0))  Ah(x1, 00) 


J0, 00, 
HC 90) = AN( Xz, 00) AN( Xz, 00) 
00, 005 
x) = O19 X4 
Ba) +X) (Oa +x)” 
7 X2 — 00X3 


Ow +x (G4) +x) 
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where 6,) and 6,) are the elements of 0. In this example, H(0,) is a square 
matrix. Hence, 


det[H’(,)H(0,)] = {det[H(6,)]}” 


2 
O50 X{X5( Xp =x) 


( O29 +x1)"( 99 +x)“ l 


(8.67) 


To determine the maximum of this determinant, let us first equate its partial 
derivatives with respect to x, and x, to zero. It can be verified that the 
solution of the resulting equations (that is, the stationary point) falls outside 
the region of feasible values for x, and x, (both x, and x, must be 
nonnegative). Let us therefore restrict our search for the maximum within 
the region R = {(x,, x,)|0 <x, < Xmax» OSX) SX max}, Where xma iS the maxi- 
mum allowable substrate concentration. Since the partial derivatives of the 
determinant in formula (8.67) do not vanish in R, then its maximum must be 
attained on the boundary of R. On x,=0, or x,=0, the value of the 
determinant is zero. If x, =x,,,,, then 

Oio Xmax ¥3 (Xo —Xmax)” 
(929 +X max )“( 82 +x)" 


It can be verified that this function of x, has a maximum at the point 
X = O59 X max / (2020 +X max) With a value given by 


det[H’(@))H(0,)] = 


Oio X max 
max jdet|H’(0,)H(9,)} = 2 8.68 
Ee) 1663)( 29 +X max)” os 
Similarly, if x, =X max, then 
eee CW ter) i 


det[H’( 0) )H(0,)] = 


( O29 +x,)‘( 99 FXp) í 


which attains the same maximum value as in formula (8.68) at the point 
Xi = O59Xmax/ (2020 + Xmax) We conclude that the maximum of 
det[H'(0,)H(0,)] over the region R is achieved when x; =X mas and x, = 
O20 X max / (2020 + Xmax), OF when xi = O0 Xmax/(2 020 +X max) and x3 =X max 


We can clearly see in this example the dependence of the design settings 
on 0,, but not on 6,. This is attributed to the fact that 6, appears linearly in 
the model, but 6, does not. In this case, the model is said to be partially 
nonlinear. Its D-optimal design depends only on those parameters that do 
not appear linearly. More details concerning partially nonlinear models can 
be found in Khuri and Cornell (1996, Section 10.5.3). 
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8.7. MULTIRESPONSE OPTIMIZATION 


By definition, a multiresponse experiment is one in which a number of 
responses can be measured for each setting of a group of input variables. For 
example, in a skim milk extrusion process, the responses, y, = percent 
residual lactose and y, = percent ash, are known to depend iS the input 
variables, x, = pH level, x, = temperature, x, = concentration, and x, = time 
(see Fichtali, Van De Voort, and Khuri, 1990). 

As in single-response experiments, one of the objectives of a multire- 
sponse experiment is the determination of conditions on the input variables 
that optimize the predicted responses. The definition of an optimum in a 
multiresponse situation, however, is more complex than in the single- 
response case. The reason for this is that when two or more response 
variables are considered simultaneously, the meaning of an optimum be- 
comes unclear, since there is no unique way to order the values of a 
multiresponse function. To overcome this difficulty, Khuri and Conlon (1981) 
introduced a multiresponse optimization technique called the generalized 
distance approach. The following is an outline of this approach: 

Let r be the number of responses, and n be the number of experimental 
runs for all the responses. Suppose that these responses can be represented 
by the linear models 


=X6B,+€,, i=1,2,...,7, 


where y; is a vector of observations on the ith response, X is a known matrix 
of order n X p and rank p, B; is a vector of p unknown parameters, and e; is 
a random error vector associated with the ith response (i = 1,2,...,r). It is 
assumed that the rows of the error matrix [€,: €,: ++: €,] are statistically 
independent with each having a zero mean vector and a common 
variance—covariance matrix Ł. Note that the matrix X is assumed to be the 
same for all the responses. 

Let x), %5,...,x, be input variables that influence the r responses. The 
predicted response value at a point x = (x4, X2,.. ., X4)" in a region R for the 
ith response is given by te =f '(x)B,, where 6, = = (X'X)'X’y, is the least- 
squares estimator of B; (i= 1,2,...,r). Here, f'(x) is of the same form as a 
row of X, except that it is evaluated at the point x. It follows that 


Var[ $,(x)] = 0;,f'(x)(X'X) 'f(x), i= 1,2,...,7, 
Cov 9,(x), 9:(x)] = KOAX 'f(x), 74 =1,2,...,7 


where o; is the (i, j)th element of }. The variance-covariance matrix of 
$a) = [9,(X), 9,@),..., f] is then of the form 


Var[9(x)] = £'(x)(X’X) E)E. 
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Since $ is in general unknown, an unbiased estimator, %, of $ can be used 
instead, where 


$= 


1 “ily, 
= [1, = X(X’X)'x’]y, 


and Y = [y;: y3: --::y,]. The matrix is nonsingular provided that Y is of 
rank r<n-—p. An estimate of Var[¥(x)] is then given by 


Var[9(x)] =f'(x)(X’X) ‘f(x)%. (8.69) 


Let ¢, denote the optimum value of ,(x) optimized individually over the 
region R (i=1,2,...,r). Let b =(q,, d),..., ¢,)’. These individual optima 
do not in general occur at the same location in R. To achieve a compromise 
optimum, we need to find x that minimizes p[¥(x),], where p is some 
metric that measures the distance of y(x) from . One possible choice for p 
is the metric 


PBE) 6] = [B — 6) FEBO 5) - 4]] 


which, by formula (8.69), can be written as 


E) -ol Ea) - 6] i 
XX) E) 


pl9(x), $] = | (8.70) 


We note that p=0 if and only if ¥(x) = 9, that is, when all the responses 
attain their individual optima at the same point; otherwise, p> 0. Such a 
point (if it exists) is called a point of ideal optimum. In general, an ideal 
optimum rarely exists. 

In order to have conditions that are as close as possible to an ideal 
optimum, we need to minimize p over the region R. Let us suppose that the 
minimum occurs at the point x, € R. Then, at x, the experimental conditions 
can be described as being near optimal for each of the r response functions. 
We therefore refer to x9 as a point of compromise optimum. 

Note that the elements of in formula (8.70) are random variables since 
they are the individual optima of 9,(x), 9,(x),..., 9x). If the variation 
associated with ¢ is large, then the metric p may not accurately measure the 
deviation of ¥(x) from the true ideal optimum. In this case, some account 
should be taken of the randomness of ẹ in the development of the metric. 
To do so, let €C=(£,, &,..., ¢)', where ¢, is the optimum value of the true 
mean of the ith response optimized individually over the region R (i= 
1,2,...,r). Let D, be a confidence region for ¢. For a fixed xER and 
whenever ¢ € D,, we obviously have 


pli(s),£] < max p[5(),a]. (8.71) 
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The right-hand side of this inequality serves as an upper bound on p[¥(x), 6], 
which represents the distance of (x) from the true ideal optimum. It follows 
that 


min p[§(x),¢] < min { max pla(x). nl}. (8.72) 


The right-hand side of this inequality provides a conservative measure of 
distance between the compromise and ideal optima. 

The confidence region D, can be determined in a variety of ways. Khuri 
and Conlon (1981) considered a rectangular confidence region of the form 


Yii S G< Yis i=1,2,...,r, 


where 
Yu = $ — 8 &;) MS}? ba /2,n-p> 


(8.73) 
Voi = $ + 8:i( &;) MS;/? ba /2,n—p> 


where MS, is the error mean square for the ith response, €; is the point at 
which ,(x) attains the individual optimum ¢,, ta /2, n-p is the upper (a@/2) x 
100th percentile of the ¢-distribution with n —p degrees of freedom, and 
g(é;) is given by 


gE) = [PER ENF 1,2). ar. 


Khuri and Conlon (1981) showed that such a rectangular confidence region 
has approximately a confidence coefficient of at least 1 — a*, where a* = 1 
—-(- a)’. 

It should be noted that the evaluation of the right-hand side of inequality 
(8.72) requires that p[¥(x),] be maximized first with respect to » over D; 
for a given x € R. The maximum value thus obtained, being a function of x, is 
then minimized over the region R. A computer program for the implementa- 
tion of this min-max procedure is described in Conlon and Khuri (1992). 
A complete electronic copy of the code, along with examples, can be 
downloaded from the Internet at ftp:/ /ftp.stat-ufl.edu/pub /mr.tar.Z. 

Numerical examples that illustrate the application of the generalized 
distance approach for multiresponse optimization can be found in Khuri and 
Conlon (1981) and Khuri and Cornell (1996, Chapter 7). 


8.8. MAXIMUM LIKELIHOOD ESTIMATION 
AND THE EM ALGORITHM 


We recall from Section 7.11.2 that the maximum likelihood (ML) estimates of 
a set of parameters, 0,, 65,..., 6, for a given distribution maximize the 
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likelihood function of a sample, X4, X3,..., X,, of size n from the distribu- 
tion. The ML estimates of the 8;s denoted by 04, 02, ..., 6,, can be found by 
solving the likelihood equations (the likelihood function must be differen- 
tiable and unimodal) 


ð log| L(x, 6)| 
6, J 


L 


0, i=1,2,...,p, (8.74) 


where 0=(6;,6),...,6,)", x=(x1,%),...,x,)', and L(x,0)=f(x,0) with 
f(x, 8) being the density function (or probability mass function) of X = 
(X,, X,...,X,,)’. Note that f(x, 6) can be written as TI”; g(x;, 0), where 
g(x, 8) is the density function (or probability mass function) associated with 
the distribution. 

Equations (8.74) may not have a closed-form solution. For example, 
consider the so-called truncated Poisson distribution whose probability mass 
function is of the form (see Everitt, 1987, page 29) 


e 99% 


(1-—e7*)x!’ 


g(x, 0) = x=1,2,.... (8.75) 


In this case, 
log L(x, 6) = 10| T160) 
= —n6+ (log 0) THS 3 log x;!—nlog(1—e~°). 
i=l i=1 
Hence, 
n ne~ 


La 1-e°°’ 


=1 


JL* (x, 0) 1 
=-n+ 
30 0; 


(8.76) 


where L*(x, 6) = log L(x, 0) is the log-likelihood function. The likelihood 
equation, which results from equating the right-hand side of formula (8.76) to 
zero, has no closed-form solution for 6. 

In general, if equations (8.74) do not have a closed-form solution, then, as 
was seen in Section 8.1, iterative methods can be applied to maximize L(x, 0) 
[or L*(x, 0)]. Using, for example, the Newton—Raphson method (see Section 
8.1.2), if 6, is an initial estimate of ®© and ô, is the estimate at the ith 
iteration, then by applying formula (8.7) we have 
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where H,-(x,@) and VL*(x,0) are, respectively, the Hessian matrix and 
gradient vector of the log-likelihood function. Several iterations can be made 
until a certain convergence criterion is satisfied. A modification of this 
procedure is the so-called Fisher’s method of scoring, where H,» is replaced 
by its expected value, that is, 


A A A <1 A 
6,,,-6,- {E[H,+(x, ô,)]} VL*(x,6;),  i=0,1,2,.... (8.77) 
Here, the expected value is taken with respect to the given distribution. 
EXAMPLE 8.8.1. (Everitt, 1987, pages 30-31). Consider the truncated 
Poisson distribution described in formula (8.75). In this case, since we only 


have one parameter 0, the gradient takes the form VL*(x, 0) = dL*(x, 0)/00, 
which is given by formula (8.76). Hence, the Hessian matrix is 


ð’ L*(x, 0) 


H, +(x, 0) = 362 


Furthermore, if X denotes the truncated Poisson random variable, then 


ae > xe "9* 
Á y= be (fer)! 
0 
pee 
Thus 
E[H,+(x, 0)] =E ee) 
[ L(x, )] 062 
1 n8 ne~’ 


7 + 
02 1—e 9 (1—e78) 
ne-°(1+6)-n 

(=e) 


|. Suppose now we have the sample 1,2,3,4,5,6 from this distribution. Let 
6) = 1.5118 be an initial estimate of 0. Several iterations are made by 
applying formula (8.77), and the results are shown in Table 8.6. The final 
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Table 8.6. Fisher’s Method of Scoring for the Truncated Poisson Distribution 


aL* a7L* 
Iteration as =F 0 LF 
00 00 
1 — 685.5137 — 1176.7632 1.5118 — 1545.5549 
2 — 62.0889 — 1696.2834 0.9293 — 1303.3340 
3 — 0.2822 — 1750.5906 0.8927 — 1302.1790 
4 0.0012 — 1750.8389 0.8925 — 1302.1792 


Source: Everitt (1987, page 31). Reproduced with permission of Chapman and Hall, London. 


estimate of 0 is 0.8925, which is considered to be the maximum likelihood 
estimate of 0 for the given sample. The convergence criterion used here is 
|6;,, — | < 0.001. 


8.8.1. The EM Algorithm 


The EM algorithm is a general iterative procedure for maximum likelihood 
estimation in incomplete data problems. This encompasses situations involv- 
ing missing data, or when the actual data are viewed as forming a subset of a 
larger system of quantities. 

The term EM was introduced by Dempster, Laird, and Rubin (1977). The 
reason for this terminology is that each iteration in this algorithm consists of 
two steps called the expectation step (E-step) and the maximization step 
(M-step). In the E-step, the conditional expectations of the missing data are 
found given the observed data and the current estimates of the parameters. 
These expected values are then substituted for the missing data and used to 
complete the data. In the M-step, maximum likelihood estimation of the 
parameters is performed in the usual manner using the completed data. 
More generally, missing sufficient statistics can be estimated rather than the 
individual missing data. The estimated parameters are then used to reesti- 
mate the missing data (or missing sufficient statistics), which in turn lead to 
new parameter estimates. This defines an iterative procedure, which can be 
carried out until convergence is achieved. 

More details concerning the theory of the EM algorithm can be found in 
Dempster, Laird, and Rubin (1977), and in Little and Rubin (1987, Chapter 
7). The following two examples, given in the latter reference, illustrate the 
application of this algorithm: 


EXAMPLE 8.8.2. (Little and Rubin, 1987, pages 130-131). Consider a 
sample of size n from a normal distribution with a mean w and a variance 


o*. Suppose that x,,%5,...,%x are observed data and that 


m 


XintioXm42o+++) X, are missing data. Let xy, = (X1; X2,.--,X,,)'. For i =m 
+ 1,m+2,...,n, the expected value of X, given X and 0 =( u, 07)’ is p. 


Now, from Example 7.11.3, the log-likelihood function for the complete data 
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set is 


L*(x, 6) = Le se +np*|, (8.78) 


i=l 


where x = (x,,x5,...,X,)’. We note that ©”_,X/ and X?_; X, are sufficient 
statistics. Therefore, to apply the E-step of the algorithm, we only have to 
find the conditional expectations of these statistics given X», and the current 
estimate of 0. We thus have 


e| È x1ô,x n) = Lx; + (n-m), j=0,1,2,..., (8.79) 
i=1 

e| È x710, x)= Eat + (nm) af +6) j=0,1,2,..., (8.80) 
i=1 i=1 


where 0, =( Îi; 6,7)! is the estimate of @ at the jth iteration with 6, being an 
initial estimate. 

From Section 7.11 we recall that the maximum likelihood estimates of u 
and o° based on the complete data set are (1/n)X!_,x; and (1/n)X"_, x7 — 
[1 /n)&x*_,x;?. Thus in the M-step, these same expressions are used, except 
that the current expectations of the sufficient statistics in formulas (8.79) and 
(8.80) are substituted for the missing data portion of the sufficient statistics. 
In other words, the estimates of u and a? at the ( j + Dth iteration are given 


by 
A 1 k A . 
Aja = =| Dix + (n—m) fj), j=0,1,2,..., (8.81) 
i=1 
62, =—| Dx? +(n—m)( p?+67)|-02,,, 7 =0,1,2,.... (8.82) 
i=1 


By setting Å; = ÂÊj+ı = À and 6,=6,,; =@ in equations (8.81) and (8.82), 
we find that the iterations converge to 


which are the maximum likelihood estimates of u and o° from xop,- 
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The EM algorithm is unnecessary in this example, since the maximum 
likelihood estimates of u and o* can be obtained explicitly. 


EXAMPLE 8.8.3. (Little and Rubin, 1987, pages 131-132). This example 
was originally given in Dempster, Laird, and Rubin (1977). It involves a 
multinomial x= (x4, X2, X3, X4) with cell probabilities (4 — 40, $0, 40,4)’, 
where 0 < 0 < 1. Suppose that the observed data consist of X p, = (38, 34, 125)’ 
such that x, = 38, x, = 34, x, +x, = 125. The likelihood function for the 
complete data is 


(x1 +x) +x; +x4)! x x x Xx, 
X,!x,!x4!x,! (2 = 79) (30) (28) (2) % 


L(x, 0) = 


The log-likelihood function is of the form 


(xi +x, +x, +x,4)! 


L*(x, 0) = log +x, log(s — 74) 


X,!x,!x5!x,! 
+x, log(}@) +x, log(4@) +x, log(4). 


By differentiating L*(x, 0) with respect to 6 and equating the derivative to 
zero we obtain 


(8.83) 


Let us now find the conditional expectations of X,, X,, X3, X, given the 
observed data and the current estimate of 0: 


A 


E(X,| 6;,X obs) = 38, 


E( X31 ĝ;, Xos) = 34, 


L 


E(X; Â, Xops ) = 


E(X, Âi, Xobs ) = 
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Table 8.7. The EM Algorithm for Example 8.8.3 


A 


Iteration 0 


0.500000000 
0.608247423 
0.624321051 
0.626488879 
0.626777323 
0.626815632 
0.626820719 
0.626821395 
0.626821484 


© 


CADNMAHPWNPR 


Source: Little and Rubin (1987, page 132). Reproduced 
with permission of John Wiley & Sons, Inc. 


Thus at the (i + 1)st iteration we have 
A 34 + (125)(48,)/ (2 + 18,) 


ĉizi = 38 +34 + (125)(46,)/(4 + 48)? oe 


as can be seen from applying formula (8.83) using the conditional expectation 
of X; instead of x,. Formula (8.84) can be used iteratively to obtain the 
maximum likelihood estimate of 0 on the basis of the observed data. Using 
an initial estimate 65 = 4, the results of this iterative procedure are given in 
Table 8.7. Note that if we set 6,,, = 6, = 6 in formula (8.84) we obtain the 
quadratic equation, 


19762 — 156-68 = 0 


whose only positive root is 6 = 0.626821498, which is very close to the value 
obtained in the last iteration in Table 8.7. 


8.9. MINIMUM NORM QUADRATIC UNBAISED ESTIMATION 
OF VARIANCE COMPONENTS 


Consider the linear model 


y=Xa+ > UB,, (8.85) 


i=1 


where y is a vector of n observations; @ is a vector of fixed effects; 
B,,B>,..-,B. are vectors of random effects; X,U,,U,,...,U, are known 
matrices of constants with B, =e, the vector of random errors; and U, = I,. 
We assume that the £,’s are uncorrelated with zero mean vectors and 
variance—covariance matrices olm, where m, is the number of columns of 
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2 


U, (i=1,2,...,c). The variances of, o7,..., 0, are referred to as variance 


components. Model (8.85) can be written as i 
y=Xa+ UB, (8.86) 
where U =[U;: U,: =: U,], B = (8), B2- --, BL)’. From model (8.86) we have 


E(y) =Xa, 


c 8.87 
Var(y) = © 0; V, A 
i=1 


with V, = U,U;. 

Let us consider the estimation of a linear function of the variance 
components, namely, Ł$_;a;0;, where the aps are known constants, by a 
quadratic estimator of the form y’Ay. Here, A is a symmetric matrix to be 
determined so that y’Ay satisfies certain criteria, which are the following: 


1. Translation Invariance. If instead of a we consider y= a— aj, then 
from model (8.86) we have 


y—Xa,=Xy+ UB. 


In this case, L<_,a;0;? is estimated by (y— Xa,)’A(y— Xa). The 
estimator y’Ay is said to be translation invariant if 


y Ay = (y— Xa,)'A(y— Xag). 
In order for this to be true we must have 
AX = 0. (8.88) 
2. Unbiasedness. E(y'Ay) = X<_,a;0;°. Using a result in Searle (1971, The- 
orem 1, page 55), the expected value of the quadratic form y’ Ay is given 
by 
E(y'Ay) = a'X'AXa + tr[A Var(y)], (8.89) 


since E(y) = Xa. From formulas (8.87), (8.88), and (8.89) we then have 
E(y'Ay) = X 0? tr(AV)). (8.90) 
i=l 


By comparison with L$_,a;0;°, the condition for unbiasedness is 


ut 9 


a, = tr(AV,), i=1,2,...,¢. (8.91) 
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3. Minimum Norm. If B4, B2,..., Be in model (8.85) were observable, then 


a natural unbaised estimator of L{_,a;0;° would be L{_,a;8',B;/m;, 
since E(B;B;) = tr, 0,7) =m,o,7, i= 1,2,...,c. This estimator can be 
written as B'AB, where A is the block-diagonal matrix 


: ay ay a. 
A = Diag I ah 
mi 


mı? 
mM, 


The difference between this estimator and y’Ay is 
y'Ay — B'AB = B'(U'AU — A)B, 


since AX = 0. This difference can be made small by minimizing the 
Euclidean norm ||U'AU — All). 

The quadratic estimator y'Ay is said to be a minimum norm quadratic 
unbiased estimator (MINQUE) of Y<_,a;0;? if the matrix A is deter- 
mined so that ||U'AU — All, attains a minimum subject to the condi- 
tions given in formulas (8.88) and (8.91). Such an estimator was intro- 
duced by Rao (1971, 1972). 

The minimization of ||U’AU — All, is equivalent to that of tr(AVAV), 
where V = )_,V,. The reason for this is the following: 


|U’AU — All; = tr[(U’AU — A) (U'AU — A)] 
= tr(U’AUU’AU) — 2tr(U’AUA) + tr( A’). (8.92) 
Now, 


tr(U’AUA) = tr(AUAU’) 


a; 
= ufa 2 urnu] 
m "™ 


i=1 i 


e 


E —uav) 


i=1 Mi 
(i 
= 
i=l 


=tr(A’). 


a by (8.91 


L 
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Formula (8.92) can then be written as 
||U’AU — Alls = tr(U’AUU' AU) — tr( A?) 
= tr(AVAV) — tr( A’), 
since V = L<_,V, = L<_,U,U; = UU’. The trace of A? does not involve A; 
hence the problem of MINQUE reduces to finding A that minimizes 


tr(AVAV) subject to conditions (8.88) and (8.91). Rao (1971) showed 
that the solution to this optimization problem is of the form 


c 
A= > à RVR, (8.93) 
where 
R=V! -VIX(X'V!X) XV! 


with (X'V"'X)~ being a generalized inverse of X'V~'X, and the A,’s 
are obtained from solving the equations 


Me 


A; tr(RV,RV,) =4;, FEND 
i=1 


which can be expressed as 
A'S =a’, (8.94) 


where A=(A;,Az,...,A,)’, S is the cXc matrix (s;;) with s,;= 
tr(RV,RV,), and a=(a,,a,...,4,)'. The MINQUE of ¥{_,a;0,7 can 
then be written as 
| 
i 


where q= (q1, q2,---,q)' with q; = y'RV,;Ry (i = 1,2,...,c). But, from 
formula (8.94), N’ =a'S7, where S7 is a generalized inverse of S. 
Hence, A'q=a'S7q=a'ĝ, where 6 =(6/, ĉf,..., 6 )' is a solution 
of the equation 


Me 


c 
RVR] = }, A;y'RV,Ry 
i=1 


1 
=q, 


Sô =q. (8.95) 


This equation has a unique solution if and only if the individual 
variance components are unbiasedly estimable (see Rao, 1972, page 
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114). Thus the MINQUEs of the o,’’s are obtained from solving 
equation (8.95). 

If the random effects in model (8.85) are assumed to be normally 
distributed, then the MINQUEs of the variance components reduce to 
the so-called minimum variance quadratic unbiased estimators 
(MIVQUEs). An example that shows how to compute these estimators 
in the case of a random one-way classification model is given in 
Swallow and Searle (1978). See also Milliken and Johnson (1984, 
Chapter 19). 


8.10. SCHEFFE’S CONFIDENCE INTERVALS 
Consider the linear model 
y=XBrte, (8.96) 


where y is a vector of n observations, X is a known matrix of order n X p and 
rand r (<p), B is a vector of unknown parameters, and e is a random error 
vector. It is assumed that e has the normal distribution with a mean 0 and a 
variance—covariance matrix o7I,,. Let y= a'B be an estimable linear func- 
tion of the elements of B. By this we mean that there exists a linear function 
t'y of y such that E(t'y) = y, where t is some constant vector. A necessary 
and sufficient condition for w to be estimable is that a’ belongs to the row 
space of X, that is, a’ is a linear combination of the rows of X (see, for 
example, Searle, 1971, page 181). Since the rank of X is r, the row space of X, 
denoted by p(X), is an r-dimensional subspace of the p-dimensional 
Euclidean space R”. Thus a’B estimable if and only if a’ € p(X). 

Suppose that a’ is an arbitrary vector in a q-dimensional subspace & of 
p(X), where q<r. Then a’B =a’(X’'X) X'y is the best linear unbiased 
estimator of a’B, and its variance is given by 


Var(a'B) = oa'(X’X) a, 


where (X’X) isa generalized inverse of X'X (see, for example, Searle, 1971, 
pages 181-182). Both a’B and a’(X’X)~a are invariant to the choice of 
(X'X), since a’B is estimable (see, for example, Searle, 1971, page 181). In 
particular, if r =p, then X’X is of full rank and (X'X)” = (X'X)"!. 


Theorem 8.10.1. Simultaneous (1 — @)100% confidence intervals on a'ß 
for all a’ EZ, where Z is a q-dimensional subspace of p(X), are of the form 


aÊ F (QMS p Fa gn-r)” [a'(X’X) al”, (8.97) 


29,n—-r 
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where F, is the upper a100th percentile of the F-distribution with q 


a,q,n=r 


and n —r degrees of freedom, and MS, is the error mean square given by 


MS; = 


aE —X(X'X) X’ly. (8.98) 


In Theorem 8.10.1, the word “simultaneous” means that with probability 
1— a, the values of a’B for all a’ © satisfy the double inequality 


a’B —(qMS; | ee ta [a’(X'X) 7 al a 


<a'B <a’ + (qMS; Fy qn) [a XX) a]. (8.99) 


>q, NF 


A proof of this theorem is given in Scheffé (1959, Section 3.5). Another proof 
is presented here using the method of Lagrange multipliers. This proof is 
based on the following lemma: 


Lemma 8.10.1. Let C be the set {x € R1|x'Ax < 1}, where A is a positive 
definite matrix of order q Xq. Then x€C if and only if |I’x| <(’A7'D!” 
for all 1 € R4. 


Proof. Suppose that x € C. Since A is positive definite, the boundary of C 
is an ellipsoid in a q-dimensional space. For any 1 € R4, let e be a unit vector 
in its direction. The projection of x on an axis in the direction of | is given by 
e’x. Consider optimizing e’x with respect to x over the set C. The minimum 
and maximum values of e’x are obviously determined by the end points of the 
projection of C on the l-axis. This is equivalent to optimizing e’x subject to 
the constraint x’Ax = 1, since the projection of C on the I-axis is the same as 
the projection of its boundary, the ellipsoid x’Ax=1. This constrained 
optimization problem can be solved by using the method of Lagrange 
multipliers. 

Let G = e’x + A(x’Ax — 1), where A is a Lagrange multiplier. By differenti- 
ating G with respect to x,,x,,...,%,, where x; is the ith element of x 
(i=1,2,...,q), and equating the derivatives to zero, we obtain the equation 
e + 2A Ax = 0, whose solution is x = —(1/2)A~'e. If we substitute this value 
of x into the equation x'Ax = 1 and then solve for A, we obtain the two 
solutions A, = — $(e’A~'e)'/?, A, = $(e’A~‘e)'””. But, e’x = —2A, since x’Ax 
=1. It follows that the minimum and maximum values of e’x under the 
constraint x'Ax = 1 are —(e’A~'e)'”* and (e’A te), respectively. Hence, 


le'x| <(e’A'e)”. (8.100) 


Since 1=||I||,e, where |ll||, is the Euclidean norm of 1, multiplying the two 
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sides of inequality (8.100) by |IIllz yields 
Ix] < (WAT), (8.101) 


Vice versa, if inequality (8.101) is true for all 1 € R4, then by choosing 
l =x’A we obtain 


|x’Ax| < (x'AA!Ax)'7”, 


which is equivalent to x’Ax < 1, that is, x € C. o 


Proof of Theorem 8.10.1. Let L be a q Xp matrix of rank q whose rows 
form a basis for the q-dimensional subspace Y of p(X). Since y in model 
(8.96) is distributed as N(XB, o7I,), LB =L(X'X) X'y is distributed as 
NILB, o 7L(X'X)~L’]. Thus the random variable 


[L(6 -B Lax L] Ê- B)| 


F= 
qMS;, 


has the F-distribution with q and n —r degrees of freedom (see, for example, 
Searle, 1971, page 190). It follows that 


P(F<F, =l-a. (8.102) 


gn-r) 
By applying Lemma 8.10.1 to formula (8.102) with x = L(Ê — B) and A= 
[L(X'X) “L']}"' /(q MS; F, ) we obtain the equivalent probability state- 
ment 


K 


Let a’ = I'L. We then have 


a 


We conclude that the values of a’B satisfy the double inequality (8.99) for all 
a’ © Z with probability 1 — a. Simultaneous (1 — æ)100% confidence inter- 
vals on a’B are therefore given by formula (8.97). We refer to these intervals 
as Scheffé’s confidence intervals. 

Theorem 8.10.1 can be used to obtain simultaneous confidence intervals 
on all contrasts among the elements of B. By definition, the linear function 
a'B is a contrast among the elements of B if U?_,a;=0, where a, is the ith 
element of a @=1,2,..., p). If a’ is in the row space of X, then it must 
belong to a q-dimensional subspace of p(X), where q =r — 1. Hence, simul- 


s4, Nr 


1/2 


VL(Ê - B)|< (4MSr F, qn) WLX Li] vier} =1- a. 


a'(Ê- B)|< (4MSe Fy qn) [EXX a] va ex) =1-a. 
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taneous (1—a@)100% confidence intervals on all such contrasts can be 
obtained from formula (8.97) by replacing q with r—1. o 


8.10.1. The Relation of Scheffé’s Confidence Intervals to the F-Test 


There is a relationship between the confidence intervals (8.97) and the F-test 
used to test the hypothesis Ho: LB = 0 versus H,: LB #0, where L is the 
matrix whose rows form a basis for the q-dimensional subspace & of p(X). 
The test statistic for testing Hy is given by (see Searle, 1971, Section 5.5) 


_ BIL'[L(X’xX) L] LB 
j qMSr í 


which under H, has the F-distribution with q and n —r degrees of freedom. 
The hypothesis H, can be rejected at the a-level of significance if F > 
Fa g,n—r- In this case, by Lemma 8.10.1, there exits at least one 1 € R? such 


that 


A x 1/2 
IWLBI > (QMS Fy qn) (VL(X'X) La”. (8.103) 
It follows that the F-test rejects Hy if and only if there exists a linear 
combination a’B, where a’ = I'L for some 1€ R1, for which the confidence 
interval in formula (8.97) does not contain the value zero. In this case, a’B is 
said to be significantly different from zero. 


It is easy to see that inequality (8.103) holds for some 1 € R@ if and only if 


[VLBI? 


sup —————— >qMS;F, , , 
ae V'L(X’X) L'I 1DE Pa gen 


or equivalently, 


Gl 
SUP TGI >qMSg Fy q,n-r> (8.104) 
where 
G =LBB'L’, (8.105) 
G, =L(X'X) L'. (8.106) 


However, by Theorem 2.3.17, 


= 6’L'[L(x’x) L'| LÊ, (8.107) 
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where €iax(G3'G,) is the largest eigenvalue of G3 'G;. The second equality 
in (8.107) is true because the nonzero eigenvalues of [LXXV L’]~ ‘LBB’ L’ 
are the same as those of Î'L'ILX' Y L'I 'LÊ by Theorem 2.3.9. Note that 
the latter expression is the numerator sum of squares of the F-test statistic 
for Ho. 

The eigenvector of G>'G, corresponding to e 
interest. Let 1* be such an eigenvector. Then 


(G5'G,) is of special 


max 


1*’G,I* 
1¥'G,1* 


= e max (G3 G3). (8.108) 


This follows from the fact that 1* satisfies the equation 
(G, a Cmax G2 )1* = 0, 


where e max iS an abbreviation for ema (G3! G,). It is easy to see that 1* can be 
chosen to be the vector G3 LB, since 


G;'G,(G;'LB) = G;'LBB’L'(G;'LB) 
= (Ê'L'G, 'LÊ)G, 'LÊ 
B enx GZ LÊ. 


This shows that G3 'LB is an eigenvector of G;'G, for the eigenvalue ena 
From inequality (8.104) and formula (8.108) we conclude that if the F-test 
rejects H, at the a-level, then 


1/2 


"LB > (qMS_ F, PERLAN) L'i] 


a,q,n-r 


This means that the linear combination a*’B, where a* 


cantly different from zero. Let us express a*’B as 


'=1*'L, is signifi- 


q 
LB = DIY, (8.109) 


where /* and ¥, are the ith elements of 1* and =L, respectively 
(i=1,2,...,q). If we divide ¥, by its estimated standard error R; [which is 
equal to the square root of the ith diagonal element of the variance—covari- 
ance matrix of LÊ, namely, o*L(X'X)~L’ with o? replaced by the error 
mean square MS,, in formula (8.98)], then formula (8.109) can be written as 


q 
I’ LB = P FR4,, (8.110) 


where 7; = ĵ;/ Ri i= 1,2,...,q. Consequently, large values of |/*|k; identify 
those elements of ¥ that are influential contributors to the significance of 
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the F-test concerning H,. Note that the elements of y= LB form a set of 
linearly independent estimable linear functions of B. 

We conclude from the previous arguments that the eigenvector 1*, which 
corresponds to the largest eigenvalue of G;'G,, can be conveniently used to 
identify an estimable linear function of B that is significantly different from 
zero whenever the F-test rejects Ho. 

It should be noted that if model (8.96) is a response surface model (in this 
case, the matrix X in the model is of full column rank, that is, r = p) whose 
input variables, x4, X2,..., Xp, have different units of measurement, then 
these variables must be made scale free. This is accomplished as follows: If 
x,; denotes the wth measurement on x,, then we may consider the transfor- 
mation 


where ¥; = (1/n)E} -1 Xup S; = [X21 (xu -*)°]'”, and n is the total number 
of observations. One advantage of this scaling convention, besides making the 
input variables scale free, is that it can greatly improve the conditioning of 
the matrix X with regard to multicollinearity (see, for example, Belsley, Kuh, 
and Welsch, 1980, pages 183-185). 


EXAMPLE 8.10.1. Let us consider the one-way classification model 
Yj = Meta, + Ejj, i=1,2,...,m;j=1,2,...,n;, (8.111) 


where u and a, are unknown parameters with the latter representing the 
effect of the ith level of a certain factor at m levels; n; observations are 
obtained at the ith level. The ¢;,;’s are random errors assumed to be 
independent and normally distributed with zero means and a common 
variance a’. 

Model (8.111) can be represented in vector form as model (8.96). Here, 
Y = Yip Yiz -+ -> Vany Yav Yaz» ++ Vang ess Ymi Vina ++ +> Yman, > B= 
(u, Qi, @,---, @,,)', and X is of order n X (m + 1) of the form X =[1, :T], 
where 1, is a vector of ones of order nX1, n=Li7,n,, and T= 
Diag(1,,,1,,,---,1,,). The rank of X is r=m. For such a model, the 
hypothesis of interest is 


Hy: aq =a, = e =a 


which can be expressed as Họ: LB =0, where L is a matrix of order 
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(m— 1) X(m + 1) and rank m — 1 of the form 


0 1 -1 0 0 

0 1 0 =l 0 
L=|. 4 

0 1 0 0 -1 


This hypothesis states that the factor under consideration has no effect on 
the response. Note that each row of L is a linear combination of the rows of 
X. For example, the ith row of L—whose elements are equal to zero except 
for the second and the (i +2)th elements, which are equal to 1 and —1, 
respectively—is the difference between rows 1 and v;+1 of X, where 
v= Dat, i=1,2,...,m—1. Thus the rows of L form a basis for a 
q-dimensional subspace -Z of p(X), the row space of X, where q =m- 1. 

Let w,="wt+a;,. Then u; is the mean of the ith level of the factor 
(i=1,2,...,m). Consider the contrast w= L"",c; u; that is, LU" ,c;=0. We 
can write w=a’B, where a’ = (0,c,,C),...,C,,) belongs to a q-dimensional 
subspace of R”+!, This subspace is the same as Y, since each row of L is of 
the form (0, c,,C>,...,¢,,) with “7! ,c; =0. Vice versa, if y= a'B is such that 
a’ =(0,C),C5,.--5C) with L?,c; =0, then a’ can be expressed as 


a’ =(—C,,—C3,..., —C,,) L, 


since c, = —Lj_,c,;. Hence, a’ € Z. It follows that & is a subspace associ- 
ated with all contrasts among the means pj, M2, .--, Wm Of the m levels of the 
factor. 

Simultaneous (1 — a)100% confidence intervals on all contrasts of the 
form y=}? ,c; u; can be obtained by applying formula (8.97). Here, q = 
m — 1, r=™m, and a generalized inverse of X’'X is of the form 


iv\- _}]0 90’ 
where D = Diag(n;',nz',...,;,') and 0 is a zero vector of order m X 1. 
Hence, 


a'ĝ = (0, c1, Cy5+++5Cm)(X'X) X'y 


I 
LM: 
P 
< 
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where y,=(/njX%,y;;, i= 1,2,...,m. Furthermore, 


z m c? 
a'(X'X) a= L m 
i=1 "i 


By making the substitution in formula (8.97) we obtain 


g> 


m m 1/2 
È cy F [(m z 1) MS, Tessa | ys =| . (8.112) 
j i=1 "i 


Now, if the F-test rejects Hy at the a-level, then there exists a contrast 
L c* y; = a*'B, which is significantly different from zero, that is, the inter- 
val (8.112) for c; =c* (i= 1,2,...,m) does not contain the value zero. Here, 
a*' =1*'L, where I* = G>'LB is an eigenvector of G>'G, corresponding to 
(G;'G,). We have that 


E mam 


G5! = [LAN L'] i 


2 


1 
= RE + A 
nı 


where J„-ı is a matrix of ones of order (m-—1)X(m-— 1), and A= 
Diag(n;',n;',...,1;,'). By applying the Sherman-Morrison-Woodbury for- 
mula (see Exercise 2.15), we obtain 


G;! =n(n,A+ PR OES 
Apal AT! 


= = m-1 
hyt ya AA Taa 
n 
1jn 
= Diag(n,,n,,...,%m)——| - |[15,75,---.%n|- 
als 
Nm 
Also, 
Yi. yY. 
A a vue 
4=LB=L(X'X) X'y=| "2° 
Yi. — Yn. 


It can be verified that the ith element of 1* = G3 'LB is given by 


n 


m 
i+1 a = x 
T X ni(31=;). i=1,2,...,m—1. (8.113) 

j= 


IF = Ni41(¥1.-Viar.) 
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The estimated standard error, R;, of the ith element of Ẹ (i = 1,2,..., m — 1) 
is the square root of the ith diagonal element of MS, L(X'X)”L’= 
[(1/n)J,-1 + AIMS,, that is, 


A es 
kK; = 


Thus by formula (8.110), large values of |/*|«; identify those elements of 
4 = LB that are influential contributors to the significance of the F-test. In 
particular, if the data set used to analyze model (8.111) is balanced, that is, 
n;=n/m for i= 1,2,...,m, then 


1 1 1/2 
— + MS; > i=1,2,...,.m—1. 
Ni Miş 


2n 1/2 
UPR, = Hiss 3.1 mss] ; i=1,2,...,m— 1, 
m 


where y = (1/m)X;i-1 Ji.. Z 
Alternatively, the contrast a*’B can be expressed as 


a*’B =1*'L(X'X) X'y 


m-1 
= b, Pa LP = lý, 5, Mths] (0.5, Ja In) 


c= T (8.114) 


ok a, 
=E. i=2,3,...,m. 


Since the estimated standard error of y, is (MS,/n;)'/*, i=1,2,...,m, by 
dividing y, by this value we obtain 


> mfl 1/2 
ea, 
iHi 


L 


where w, =y; /(MS;/n;)'” is a scaled value of y, (i=1,2,...,m). Hence, 


large values of (MSp/n;)”|c¥| identify those ¥,’s that contribute signifi- 
cantly to the rejection of H}. In particular, for a balanced data set, 


L 


i=. -a.) i=1,2,...,m—1. 
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Thus from formula (8.114) we get 


n; 


1/2 p 1⁄2 
etl = (ZMsSe) 9-3. 51,2. 
m 
We conclude that large values of |y; —y_| are responsible for the rejection of 
H, by the F-test. This is consistent with the fact that the numerator sum of 
squares of the F-test statistic for H} is proportional to ©” (7; — 9.) when 
the data set is balanced. 
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EXERCISES 


8.1. Consider the function 
f (41, x2) = 8x] — 4x,x, + 5x3. 


Minimize f(x,, x,) using the method of steepest descent with x, = G, 2)’ 
as an initial point. 


8.2. Conduct a simulated steepest ascent exercise as follows: Use the 
function 


(X41, X2) =47.9 + 3x, — x, + 4x? + 4x, x, + 3x3 


as the true mean response, which depends on two input variables x, 
and x,. Generate response values by using the model 


y(x) = n(x) +, 
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8.4. 
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where e has the normal distribution with mean 0 and variance 2.25, 
and x =(x,,x,)’. Fit a first-order model in x, and x, in a neighbor- 
hood of the origin using a 2? factorial design along with the corre- 
sponding simulated response values. Make sure that replications are 
taken at the origin in order to test for lack of fit of the fitted model. 
Determine the path of steepest ascent, then proceed along it using 
simulated response values. Conduct additional experiments as de- 
scribed in Section 8.3.1. 


Two types of fertilizers were applied to experimental plots to assess 
their effects on the yield of a certain variety of potato. The design 
settings used in the experiment along with the corresponding yield 
values are given in the following table: 


Original Settings Coded Settings Yield y 

Fertilizer 1 Fertilizer 2 Xi DA (Ib /plot) 
50.0 15.0 —1 —1 24.30 
120.0 15.0 1 —1 35.82 
50.0 25.0 —1 1 40.50 
120.0 25.0 1 1 50.94 
35.5 20.0 25 0 30.60 
134.5 20.0 ge 0 42.90 
85.0 12.9 0 <7 22.50 
85.0 27.1 0 ove 50.40 
85.0 20.0 0 0 45.69 


(a) Fit a second-order model in the coded variables 


F,-85 
an 


F,— 20 
5 


x, 3 X, = 


to the yield data, where F, and F, are the original settings of 
fertilizers 1 and 2, respectively, used in the experiment. 

(b) Apply the method of ridge analysis to determine the settings of the 
two fertilizers that are needed to maximize the predicted yield (in 
the space of the coded input variables, the region R is the interior 
and boundary of a circle centered at the origin with a radius equal 
to 21⁄2). 


Suppose that A, and A, are two values of the Lagrange multiplier À 
used in the method of ridge analysis. Let f; and f, be the correspond- 
ing values of § on the two spheres x’x =r? and x'x =r}, respectively. 


Show that if r; =r, and A, >A,, then ~, >J). 
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8.5. 


8.6. 


8.7. 


Consider again Exercise 8.4. Let x, and x, be the stationary points 
corresponding to A, and A,, respectively. Consider also the matrix 


M(x;)=2(B-AJ), i= 1,2. 


Show that if r; =r, M(x,) is positive definite, and M(x,) is indefinite, 
then f; <J,. 


Consider once more the method of ridge analysis. Let x be a stationary 
point that corresponds to the radius r. 


(a) Show that 


where x; is the ith element of x (i = 1,2,...,k). 
(b) Make use of part (a) to show that 


Suppose that the “true” mean response n(x) is represented by a model 
of order d, in k input variables x,,x,,...,x, of the form 


n(x) =f'(x)B + g'(x)ò, 


where x = (x4, X5,...,x,)’. The fitted model is of order d, (< d,) of the 
form 


S(x) =f), 


where A is an estimator of B, not necessarily obtained by the method 

of least squares. Let y = E(X). 

(a) Give an expression for B, the average squared bias of )(x), in terms 
of B, 6, and y. 

(b) Show that B achieves its minimum value if and only if y is of the 
form y= Cr, where + =(P’,8’)’ and C =[I: r} T]. The matri- 
ces F}; and T, are the region moments used in formula (8.54). 

(c) Deduce from part (b) that B achieves its minimum value if and 


only if C7 is an estimable linear function (see Searle, 1971, Section 
5.4). 
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(d) Use part (c) to show that B achieves its minimum value if and only 
if there exists a matrix L such that C = L[X: Z], where X and Z are 
matrices consisting of the values taken by f'(x) and g'(x), respec- 
tively, at n experimental runs. 

(e) Deduce from part (d) that B achieves its minimum for any design 
for which the row space of [X: Z] contains the rows of C. 

(f) Show that if X is the least-squares estimator of B, that is, d= 
(X'X)"'X’y, where y is the vector of response values at the n 
experimental runs, then the design property stated in part (e) holds 
for any design that satisfies the conditions described in equations 
(8.56). 

[ Note: This problem is based on an article by Karson, Manson, and 
Hader (1969), who introduced the so-called minimum bias estima- 
tion to minimize the average squared bias B.] 


Consider again Exercise 8.7. Suppose that f'(x)B = By + X3_, B;x; is a 
first-order model in three input variables fitted to a data set obtained 
by using the design 


TE 7E E 

D E E TE , 
E TE E 
TE E E 


where g is a scale factor. The region of interest is a sphere of radius 1. 
Suppose that the “true” model is of the form 


3 
n(x) = By + 2 Bixi + Biz X4X_ + Biz X1X3 + B23 XX3. 
i=1 


(a) Can g be chosen so that D satisfies the conditions described in 
equations (8.56)? 

(b) Can g be chosen so that D satisfies the minimum bias property 
described in part (e) of Exercise 8.7? 


Consider the function 
h(, D) =98'A6, 
where 6 is a vector of unknown parameters as in model (8.48), A is the 


matrix in formula (8.53), namely A= A'T A — FA -A'T + F,, and 
D is the design matrix. 
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(a) Show that for a given D, the maximum of /(8,D) over the region 
y= {8|5'5 < r°} is equal to r° e nax ( A), where emal A) is the largest 
eigenvalue of A. 


max 


(b) Deduce from part (a) a design criterion for choosing D. 


. Consider fitting the model 


y(x) =f'(x)B +e, 
where e is a random error with a zero mean and a variance ao”. 
Suppose that the “true” mean response is given by 


n(x) = f'(x)B + g'(x)ò. 


Let X and Z be the same matrices defined in part (d) of Exercise 8.7. 
Consider the function A(8, D) = 6’S5, where 


S = z'|I—x(x’x) 'x’|z, 


and D is the design matrix. The quantity A(6,D)/o° is the noncentral- 
ity parameter associated with the lack of fit F-test for the fitted model 
(see Khuri and Cornell, 1996, Section 2.6). Large values of A/a? 
increase the power of the lack of fit test. By formula (8.54), the 
minimum value of B is given by 


n 
Buin = — ô'Tô, 


min ge 
where T=r, -r r3 T. The fitted model is considered to be 


inadequate if there exists some constant «>0 such that 5’T6> k. 
Show that 


inf 8'S8 = k emn (TTS), 

5e@D 
where ¢,,;,(1~'S) is the smallest eigenvalue of T~'S and © is the 
region {5|8’T5 = x}. 
[ Note: On the basis of this problem, we can define a new design 
criterion, that which maximizes e min(T7'S) with respect to D. A design 


chosen according to this criterion is called A,-optimal (see Jones and 
Mitchell, 1978).] 


A second-order model of the form 


2 2 
y(x) = Bo + By xX, + B2x3 + By x7 + By x3 + By XX, + € 
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is fitted using a rotatable central composite design D, which consists of 
a factorial 2” portion, an axial portion with an axial parameter a = 2", 
and ny center-point replications. The settings of the 2° factorial 
portion are +1. The region of interest R consists of the interior and 
boundary of a circle of radius 2'/* centered at the origin. 

(a) Express V, the average variance of the predicted response given by 

formula (8.52), as a function of no. 
(b) Can ny be chosen so that it minimizes V? 


Suppose that we have r response functions represented by the models 
y, = XB; + €,, i=1,2,...,7, 


where X is a known matrix of order n Xp and rank p. The random 
error vectors have the same variance—covariance structure as in Section 
8.7. Let F = E(Y) = XB, where Y = ly,:y,: --::y,] and B =[B,: B,:--::B,. 
Show that the determinant of (Y — F)'(Y — F) attains a minimum 
value when B = B, where B is obtained by replacing each B, in B with 
6B, =(X'X) X'y; G@ =1,2,...,r). 
[ Note: The minimization of the determinant of (Y — F)'(Y — F) with 
respect to B represents a general multiresponse estimation criterion 
known as the Box—Draper determinant criterion (see Box and Draper, 
1965).] 


Let A be a p Xp matrix with nonnegative eigenvalues. Show that 
det(A) < exp|tr(A = I,)| . 
[ Note: This inequality is proved in an article by Watson (1964). It is 


based on the simple inequality a < exp(a — 1), which can be easily 
proved for any real number a.] 


. Let x,,X,,...,x, be a sample of n independently distributed random 


vectors from a p-variate normal distribution N(p, V). The correspond- 
ing likelihood function is 


1 
L= ex 
Or [ae(w)]"? 


1 n 
-3 2 (x-a) V(x; B)|- 


It is known that the maximum likelihood estimate of p is x, where 
x =(1/n)x"_,x; (see, for example, Seber, 1984, pages 59-61). Let S be 
the matrix 


1 2 
ae (x; —X)(x;— x)’. 


i=1 
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Show that S is the maximum likelihood estimate of V by proving that 


1 wk 
CC aed Ae aaa 


1 
< ex 
[det(s)]"” P 


1 n 
|- = (9/819) 
2 int 
or equivalently, 
7 n n 
[det(Sv-')] K exp| - St(sv=)| < exp| - st(t,)]. 


[ Hint: Use the inequality given in Exercise 8.13.] 


8.15. Consider the random one-way classification model 
Yi; Z HtA; t+ &;, 1=1,2,...,a; = L a 


where the a;’s and ¢;;’s are independently distributed as N(0, o) and 
N(O, g2). Determine the matrix S and the vector q in equation (8.95) 
that can be used to obtain the MINQUEs of oa,” and g. 


8.16. Consider the linear model 
y=XBte, 


where X is a known matrix of order n Xp and rank p, and e is 
normally distributed with a zero mean vector and a variance—covari- 
ance matrix o7I,,. Let $(x) denote the predicted response at a point x 
in a region of interest R. 

Use Scheffé’s confidence intervals given by formula (8.97) to obtain 
simultaneous confidence intervals on the mean response values at the 
points x,,X5,...,X,, (m <p) in R. What is the joint confidence coeffi- 
cient for these intervals? 


8.17. Consider the fixed-effects two-way classification model 


Yij = B+ a; + p; + (aß)ij + Eijk» 


i=1,2,...,a; 7=1,2,...,b; kK=1,2,...,m, 
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where a; and £; are unknown parameters, (a@);; is the interaction 
effect, and ¢,,, is a random error that has the normal distribution with 


a zero mean and a variance a”. 


(a) Use Scheffé’s confidence intervals to obtain simultaneous confi- 
dence intervals on all contrasts among the y,’s, where pw; = E(9;.) 
and ï, = (1/bm)xF_ 1 Ek- Yije 

(b) Identify those ¥,’s that are influential contributors to the sig- 
nificance of the F-test concerning the hypothesis Hp: mw, = m, = 
= ly: 


CHAPTER 9 


Approximation of Functions 


The class of polynomials is undoubtedly the simplest class of functions. In 
this chapter we shall discuss how to use polynomials to approximate continu- 
ous functions. Piecewise polynomial functions (splines) will also be discussed. 
Attention will be primarily confined to real-valued functions of a single 
variable x. 


9.1. WEIERSTRASS APPROXIMATION 


We may recall from Section 4.3 that if a function f(x) has derivatives of all 
orders in some neighborhood of the origin, then it can be represented by a 
power series of the form X% _,a,x”. If p is the radius of convergence of this 
series, then the series converges uniformly for |x| <r, where r< p (see 
Theorem 5.4.4). It follows that for a given e> 0 we can take sufficiently many 
terms of this power series and obtain a polynomial p,(x)=X?_,a,x* of 
degree n for which |f(x) —p,(x)| < e for |x| <r. But a function that is not 
differentiable of all orders does not have a power series representation. 
However, if the function is continuous on the closed interval [a, b], then it 
can be approximated uniformly by a polynomial. This is guaranteed by the 
following theorem: 


Theorem 9.1.1 (Weierstrass Approximation Theorem). Let f: [a,b] —> R 
be a continuous function. Then, for any e> 0, there exists a polynomial p(x) 
such that 


|f(x) —p(x)|<e forall xe[a,b]. 


Proof. Without loss of generality we can consider [a, b] to be the interval 
[0, 1]. This can always be achieved by making a change of variable of the form 
x-a 


t= 7 
b-a 
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As x varies from a to b, t varies from 0 to 1. Thus, if necessary, we consider 
that such a linear transformation has been made and that t has been 
renamed as x. 

For each n, let b,(x) be defined as a polynomial of degree n of the form 


n(x) = È (*]xta n=}, (9.1) 


k=0 


where 


(7) TK! “5 17 
(n-k) 


We have that 


hiag (9.2) 
ES (z)ta-9r'=(1-2 eee, (9.4) 


These identities can be shown as follows: Let Y, be a binomial random 
variable B(n, x). Thus Y, represents the number of successes in a sequence 
of n independent Bernoulli trials with x the probability of success on a single 
trial. Hence, E(Y,) =nx and Var(Y,) =nx(1 —x) (see, for example, Harris, 
1966, page 104; see also Exercise 5.30). It follows that 


Ak x*(1—x)"* = <Y,<n)=1. 5 
E eaaa =P sY sn) =1 (9.5) 

Furthermore, 
rales (1-x)"* =E(Y,) =nx, (9.6) 


E (7) -1 = EC?) = Var.) + [EOT 
k=0 
e+e 


1 
= rx(1=2) 40x? =nel(L~ * . (9.7) 


Identities (9.2)—-(9.4) follow directly from (9.5)—(9.7). 
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Let us now consider the difference f(x) — b,(x), which with the help of 
identity (9.2) can be written as 


fa) =a) = È |r -4( =] (a-o. (98) 


Since f(x) is continuous on [0,1], then it must be bounded and uniformly 
continuous there (see Theorems 3.4.5 and 3.4.6). Hence, for the given e> 0, 
there exist numbers 6 and m such that 


€ 
|f(x) “fla)ls5 if |x, =x, |< ô 
and 


|f(x)|<m forall xe [0,1]. 


From formula (9.8) we then have 


n 


Fœ) -bls E 


k 
k=0 u 


TORIA 


(pta -. 


If |x—k/n| <6, then |f(x)—f(k/n)| < €/2; otherwise, we have |f(x)— 
f(k/n)| <2m for 0 <x <1. Consequently, by using identities (9.2)-(9.4) we 
obtain 


f(x) -6,(x) |< E 5(t|"a-»"" 


|x-k/n| <6 


+ yp 2m( |x x)" 


|x-k/n|>6 
€ k/n—x) 
<=+2m wna) (t)ta-9" 
2 |x-k/n|>6 (k/n-x) 
€ 2m k 2 
St or (= -x (ta-o 
lx—k/n| >ô 
€e 2m 2 [k? ke n er 
ges (= eine +x? (7) st -) 
k=0 
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Hence, 


P 2m x(1-x) 
a 8? 

€ m 

=a Ona?’ 


|f) —b,(*) | < 


(9.9) 


since x(1 —x) < 4 for 0 <x <1. By choosing n large enough that 
m € 
POETEN < ms 
2nd 2 
we conclude that 


|f(x) =b,(x)|< € 


for all x € [0,1]. The proof of the theorem follows by taking p(x) = b,(x). 
Oo 


Definition 9.1.1. Let f(x) be defined on [0,1]. The polynomial b,(x) 
defined by formula (9.1) is called the Bernstein polynomial of degree n for 


f(x). o 


By the proof of Theorem 9.1.1 we conclude that the sequence {b,(x)}"_, 
of Bernstein polynomials converges uniformly to f(x) on [0, 1]. These polyno- 
mials are useful in that they not only prove the existence of an approximating 
polynomial for f(x), but also provide a simple explicit representation for it. 
Another advantage of Bernstein polynomials is that if f(x) is continuously 
differentiable on [0, 1], then the derivative of b„(x) converges also uniformly 
to f'(x). A more general statement is given by the next theorem, whose proof 
can be found in Davis (1975, Theorem 6.3.2, page 113). 


Theorem 9.1.2. Let f(x) be p times differentiable on [0,1]. If the pth 
derivative is continuous there, then 


_ d’b,(x) f(x) 
ae ae 


uniformly on [0, 1]. 


Obviously, the knowledge that the sequence {b,(x)}*_, converges uni- 
formly to f(x) on [0,1] is not complete without knowing something about the 
rate of convergence. For this purpose we need to define the so-called 
modulus of continuity of f(x) on [a, b]. 
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Definition 9.1.2. If f(x) is continuous on [a,b], then, for any ô> 0, the 
modulus of continuity of f(x) on [a, b] is 


w(6)= sup f(x) = f(x2)], 


|x,—-x2| <6 


where x, and x, are points in [a, b]. o 


On the basis of Definition 9.1.2 we have the following properties concern- 
ing the modulus of continuity: 


Lemma 9.1.1. If 0 < ô; < 6,, then w(6,) < @(ô,). 


Lemma 9.1.2. For a function f(x) to be uniformly continuous on [a, b] it 
is necessary and sufficient that lim; „o w(6) = 0. 


The proofs of Lemmas 9.1.1 and 9.1.2 are left to the reader. 
Lemma 9.1.3. For any À > 0, w(A8) < (A+ Do(S). 


Proof. Suppose that A> 0 is given. We can find an integer n such that 
n<A<n+1. By Lemma 9.1.1, @(A8) < o[(n + 1)6]. Let x, and x, be two 
points in [a,b] such that x, <x, and |x, —x,| < (n + 16. Let us also divide 
the interval [x;, x,] into n + 1 equal parts, each of length (x, —x,)/(m + 1), 
by means of the partition points 


Nan Xs 
= i=0,1,... n +1. 
n 


y=) +i 
Then 
(E) =F) = lE) =F) 
5 È (ie) sf) 


<} [FO fO 
i=0 
<(n+1)o(5), 
since ly,,, —y =[1/(n + I]lx, —x,| < 6 for i=0,1,...,n. It follows that 


o[(n+ 1)8] = sup item) —f(%2)| s(t 1) 0(8). 


|xy—-x2| <(n4+1 
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Consequently, 


w(A8) < w[(n + 1)6] <(n+1)0(4) 


<(A+1)0(8). o 


Theorem 9.1.3. Let f(x) be continuous on [0,1], and let b„(x) be the 
Bernstein polynomial defined by formula (9.1). Then 


3 1 
fa) (9) 3 50(] 


for all x € [0,1], where w(6) is the modulus of continuity of f(x) on [0, 1]. 


Proof. Using formula (9.1) and identity (9.2), we have that 


rœ -aol È oœ E) aa -o 
< E |A) -s(*) (z)ta -n 
siei 


Now, by applying Lemma 9.1.3 we can write 


w E = w| n!?|x— — |n 
n 
< tamle- E) o(a), 
Thus 
$ k n n—k 
fœ) -bal X |1+n'?| x- — or) Jta -a) 

k=0 n k 
= k n n—-k 
<o(n'?)14+n'7 VP r= |( Jta- ‘ 

k=0 n\\k 
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But, by the Cauchy—Schwarz inequality (see part 1 of Theorem 2.1.2), we 
have 


x x- — axe a 

-$-a ] [ija -0] 

n 2 1/2. n 1/2 
A [Elke] 
= e (x =) (rjc y] by identity (9.2) 
= |- 2x? + l ier x? + ~ i by identities (9.3) and (9.4) 
: — me 
< al since x(1—x) <j. 


It follows that 


| f(x) =b,(x)| < o(a?) 


1/2 
1+n'? 2 
4n > 


F(x) =b,(x)| < se") 


for all x € [0, 1]. o 


that is, 


We note that Theorem 9.1.3 can be used to prove Theorem 9.1.1 as 
follows: If f(x) is continuous on [0,1], then f(x) is uniformly continuous on 
[0, 1]. Hence, by Lemma 9.1.2, œ(n7 1) > 0 as n > œ. 


Corollary 9.1.1. If f(x) is a Lipschitz continuous function Lip(K, a) on 
[0, 1], then 


[F(x) -b (x) |< 4Kn7* (9.10) 
for all x € [0,1]. 
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Proof. By Definition 3.4.6, 
|F) —f(%2)|<Klx, -x21 
for all x4, x, in [0,1]. Thus 
w()<K8*. 
By Theorem 9.1.3 we then have 
| f(x) — b,(x) | < Kn? 


for all x € [0, 1]. o 


Theorem 9.1.4 (Voronovsky’s Theorem). If f(x) is bounded on [0, 1] and 
has a second-order derivative at a point x, in [0, 1], then 


lim n[b,(xo) -= f(xo)] = zo(1 = xo) f" (xo). 


Proof. See Davis (1975, Theorem 6.3.6, page 117). o 


We note from Corollary 9.1.1 and Voronovsky’s theorem that the conver- 
gence of Bernstein polynomials can be very slow. For example, if f(x) 
satisfies the conditions of Voronovsky’s theorem, then at every point x € [0, 1] 
where f” (x) # 0, b,(x) converges to f(x) just like c/n, where c is a constant. 


EXAMPLE 9.1.1. We recall from Section 3.4.2 that f(x) = vx is Lip(1,4) 
for x > 0. Then, by Corollary 9.1.1, 


Vee 


Ani/4 2 


for 0 <x < 1, where 
n k 1/2 
b(x)= > (z}xa-2y*(=| 
= k n 


9.2. APPROXIMATION BY POLYNOMIAL INTERPOLATION 


One possible method to approximate a function f(x) with a polynomial p(x) 
is to select such a polynomial so that both f(x) and p(x) have the same 
values at a certain number of points in the domain of f(x). This procedure is 
called interpolation. The rationale behind it is that if f(x) agrees with p(x) 
at some known points, then the two functions should be close to one another 
at intermediate points. 

Let us first consider the following result given by the next theorem. 
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Theorem 9.2.1. Let ay, a,,...,a, be n +1 distinct points in R, the set of 
real numbers. Let by, b,,...,5, be any given set of n + 1 real numbers. Then, 
there exists a unique polynomial p(x) of degree <n such that p(a;) =b,, 
i=0,1,...,2. 


Proof. Since p(x) is a polynomial of degree <n, it can be represented as 
p(x) = X4_,c,x/. We must then have 


These equations can be written in vector form as 


1 ap af at. all||co bo 
1 a & eat || |e b 

a oe | ee oe eee (9.11) 
1 a, œ a? || c, b, 


The determinant of the (n + 1) x (n + 1) matrix on the left side of equation 
(9.11) is known as Vandermonde’s determinant and is equal to II, (a; —a p). 
The proof of this last assertion can be found in, for example, Graybill (1983, 
Theorem 8.12.2, page 266). Since the a,’s are distinct, this determinant is 
different from zero. It follows that this matrix is nonsingular. Hence, equa- 
tion (9.11) provides a unique solution for Co, C4,...,C m 


n’ 


Corollary 9.2.1. The polynomial p(x) in Theorem 9.2.1 can be repre- 
sented as 


DOSY OL); (9.12) 


i=0 


where 
nox—-a, 
(x)= [II L i=0,1,..., 7. (9.13) 
j=0 4; 4j 
j+i 


Proof. We have that /,(x) is a polynomial of degree n (i= 0,1,...,n). 
Furthermore, l;(a;) =0 if i +j, and /(a;)=1 (i=0,1,...,n). It follows that 

”ob;l(x) is a polynomial of degree <n and assumes the values 
by, by,..-,b, at a,a,,...,a,, respectively. This polynomial is unique by 
Theorem 9.2.1. Oo 


Definition 9.2.1. The polynomial defined by formula (9.12) is called a 
Lagrange interpolating polynomial. The points dy, a,,...,a, are called points 
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of interpolation (or nodes), and /;(x) in formula (9.13) is called the ith 
Lagrange polynomial associated with the a,’s. o 


The values bo, b,,...,5, in formula (9.12) are frequently the values of 


some function f(x) at the points ao, a,,...,a,. Thus f(x) and the polynomial 
p(x) in formula (9.12) attain the same values at these points. The polynomial 
p(x), which can be written as 


P(x) = X Fa)h(x), (9.14) 
provides therefore an approximation for f(x) over [do, a,,]. 
EXAMPLE 9.2.1. Consider the function f(x) =x'/?. Let ay = 60, a, = 70, 
a, = 85, a, = 105 be interpolation points. Then 
P(x) = 7.74601)( x) + 8.36661,(x) + 9.21951,(x) + 10.24701,(x), 
where 
Te (x — 70)(x — 85)(x — 105) . 
(60 — 70)(60 — 85)(60 — 105) 
(x — 60) (x — 85)(x — 105) 
(70 — 60)(70 — 85)(70 — 105) ’ 
Ke (x — 60)(x — 70)(x — 105) 
(85 — 60)(85 — 70)(85 — 105) 
(x — 60)(x — 70)(x — 85) 
(105 — 60)(105 — 70)(105 — 85) ` 


L(x) = 


L(x) = 


Table 9.1. Approximation of f(x) =x!/? by the Lagrange Interpolating 
Polynomial p(x) 


x f(x) p(x) 
60 7.74597 7.74597 
64 8.00000 7.99978 
68 8.24621 8.24611 
70 8.36660 8.36660 
74 8.60233 8.60251 
78 8.83176 8.83201 
82 9.05539 9.05555 
85 9.21954 9.21954 
90 9.48683 9.48646 
94 9.69536 9.69472 
98 9.89949 9.89875 
102 10.09950 10.09899 


105 10.24695 10.24695 
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Using p(x) as an approximation of f(x) over the interval [60, 105], tabulated 
values of f(x) and p(x) were obtained at several points inside this interval. 
The results are given in Table 9.1. 


9.2.1. The Accuracy of Lagrange Interpolation 


Let us now address the question of evaluating the accuracy of Lagrange 
interpolation. The answer to this question is given in the next theorem. 


Theorem 9.2.2. Suppose that f(x) has n continuous derivatives on the 
interval [a, b], and its (n + 1)st derivative exists on (a, b). Let a = ap <a, < 

<a, =b be n+1 points in [a,b]. If p(x) is the Lagrange interpolating 
polynomial defined by formula (9.14), then there exists a point c € (a, b) such 
that for any x € [a,b], x +a; (i =0,1,..., 7), 


F(x) = p(x) = Ho FO (6) Bn *), (9.15) 


(n T 


where 
8n+1( x) = [h(a =a): 


Proof. Define the function h(t) as 


nn — — x)- X En+1(t) 
h(t) =f(t)-p(t)-[f(x)-p( Me Atay: 

If t=x, then h(x) =0. For t =a; (i =0,1,...,”), 
Hay =a) pla) = 3-8 |e 


The function h(t) has n continuous derivatives on [a,b], and its (n + 1)st 
derivative exists on (a, b). Furthermore, A(t) vanishes at x and at all n+1 
interpolation points, that is, it has at least n + 2 different zeros in [a, b]. By 
Rolle’s theorem (Theorem 4.2.1), h'(t) vanishes at least once between any 
two zeros of A(t) and thus has at least n + 1 different zeros in (a, b). Also by 
Rolle’s theorem, h"(t) has at least n different zeros in (a, b). By continuing 
this argument, we see that h“"*)(t) has at least one zero in (a, b), say at the 
point c. But, 


FE) =P) rar 


A+) (t) =r) —p™* Y(t) - 3 NES En+1 (£) 


fœ) ~P(*) 


= f(rtl) = 
f () 8n+i(X) 


(n+1)!, 
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since p(t) is a polynomial of degree <n and g,,,(t) is a polynomial of the 
form t”*! + At" + At”! + +++ +A,,, for suitable constants A, A5,---, Anat: 
We thus have 


x)—p(x 
f%*D(c) = FD) = POD |, + 1)!= 0, 
8En+1(x) 
from which we can conclude formula (9.15). oO 


Corollary 9.2.2. Suppose that f‘"*(x) is continuous on [a,b]. Let 
Tn+1 7 SUPa < x< olf), Kn+1 7 SUP, < x < bl8n +100. Then 
Tn+1Kn+1 
— < ——_.. 
an IF(x) =p(x)| < Ge 
Proof. This follows directly from formula (9.15) and the fact that f(x) — 
p(x) =0 for x = dp, a),...,4 o 


ne 


We note that «,,,, being the supremum of lg, ,,(«)| = III; ox —4;)I 
over [a, b], is a function of the location of the a,’s. From Corollary 9.2.2 we 
can then write 


sup [/(2) p) |< re) guy [fD], (9.16) 
ae į 7 (n E 1)! ies 
where (ap, a,,...,4,) = SUP, < x < 418, +1). This inequality provides us with 


an upper bound on the error of approximating f(x) with p(x) over the 
interpolation region. We refer to this error as interpolation error. The upper 
bound clearly shows that the interpolation error depends on the location of 
the interpolation points. 


Corollary 9.2.3. If, in Corollary 9.2.2, n = 2, and if a, —a) =a, — 4} = Ô, 
then 


v3 
sup | f(x) p(x) |< 35738. 
a<x<b 

Proof. Consider g,(x) = (x — ay))(x — a,)(x — a,), which can be written as 
g(x) =2(z* — 8°), where z =x —a,. This function is symmetric with respect 
to x =a,. It is easy to see that |g,(x)| attains an absolute maximum over 
dy <x <a,, or equivalently, — <z < ô, when z= +8/73. Hence, 


ka= sup |g3(x)|= max |2(z?—5”)| 
ay<x<a4 —8<z<6 


TE 


r 
3⁄3 ` 
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By applying Corollary 9.2.2 we obtain 


v3 5 
sup If() =p) |< 3575 . 


ay SX <a, 


We have previously noted that the interpolation error depends on the choice 
of the interpolation points. This leads us to the following important question: 
How can the interpolation points be chosen so as to minimize the interpola- 
tion error? The answer to this question lies in inequality (9.16). One reason- 
able criterion for the choice of interpolation points is the minimization of 
(ao, a),...,4,) with respect to ao, a),...,a,. It turns out that the optimal 
locations of dy, a,,...,a,, are given by the zeros of the Chebyshev polynomial 
(of the first kind) of degree n + 1 (see Section 10.4.1). m 


Definition 9.2.2. The Chebyshev polynomial of degree n is defined by 


T (x) = cos(n Arccos x) 
=x" + (3) = 202 1) + (4 )e*(2- D+, 
2 4 
n=0,1,... .(9.17) 


Obviously, by the definition of T(x), —1 <x <1. One of the properties of 
T,(x) is that it has simple zeros at the following n points: 


(2i-1) ; 
¢; = cos oa T ; i=1,2,...,n. o 
n 


The proof of this property is given in Davis (1975, pages 61-62). 
We can consider Chebyshev polynomials defined on the interval [a, b] by 
making the transformation 
a+b b-a 
+ ———— 
2 2 


x= 


which transforms the interval —1 <¢ < 1 into the interval a <x <b. In this 
case, the zeros of the Chebyshev polynomial of degree n over the interval 
[a, b] are given by 


a+b b-a 2i-1 
Z,= + —— cos | |7 ; i=1,2,...,n. 


! 2 2 2n 


We refer to the z,’s as Chebyshev points. These points can be obtained 
geometrically by subdividing the semicircle over the interval [a,b] into n 
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equal arcs and then projecting the midpoint of each arc onto the interval (see 
De Boor, 1978, page 26). 

Chebyshev points have a very interesting property that pertains to the 
minimization of (ap, a),...,a,) in inequality (9.16). This property is de- 
scribed in Theorem 9.2.3, whose proof can be found in Davis (1975, Section 
3.3); see also De Boor (1978, page 30). 


Theorem 9.2.3. The function 


(a9, 4,,..-,4,) = sup 
a<x<b 


2 


II — a4; 
T (x-a) 


where the a;s belong to the interval [a, b], achieves its minimum at the zeros 
of the Chebyshev polynomial of degree n + 1, that is, at 


a+b b-a 2i+1 
Z= + cos TI, i=0,1,...,n, (9.18) 
2 2 2n+2 
and 
l 2(b — a)" 
min (d),4,,...,4,) = = a 
ao, a4 RS ay 


From Theorem 9.2.3 and inequality (9.16) we conclude that the choice of 
the Chebyshev points given in formula (9.18) is optimal in the sense of 
reducing the interpolation error. In other words, among all sets of interpo- 
lation points of size n+1 each, Chebyshev points produce a Lagrange 
polynomial approximation for f(x) over the interval [a,b] with a minimum 
upper bound on the error of approximation. Using inequality (9.16), we 
obtain the following interesting result: 


2 b-a n+1 
p | f(x) -p(x)|< aap 


asx< 


sup |f@*P(x)|. (9.19) 


a<x<b 


The use of Chebyshev points in the construction of Lagrange interpolating 
polynomial p(x) for the function f(x) over [a, b] produces an approximation 
which, for all practical purposes, differs very little from the best possible 
approximation of f(x) by a polynomial of the same degree. This was shown 
by Powell (1967). More explicitly, let p*(x) be the best approximating 
polynomial of f(x) of the same degree as p(x) over [a, b]. Then, obviously, 


sup iC) P(A) pean a 


a<x< 


De Boor (1978, page 31) pointed out that for n < 20, 


A f(x) —p(x)|<4 ap f(x) —p*(x) |. 


a<x< a<x< 
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This indicates that the error of interpolation which results from the use of 
Lagrange polynomials in combination with Chebyshev points does not exceed 
the minimum approximation error by more than a factor of 4 for n < 20. This 
is a very useful result, since the derivation of the best approximating 
polynomial can be tedious and complicated, whereas a polynomial approxi- 
mation obtained by Lagrange interpolation that uses Chebyshev points as 
interpolation points is simple and straightforward. 


9.2.2. A Combination of Interpolation and Approximation 


In Section 9.1 we learned how to approximate a continuous function f: 
[a,b] R with a polynomial by applying the Weierstrass theorem. In this 
section we have seen how to interpolate values of f on [a,b] by using 
Lagrange polynomials. We now show that these two processes can be 
combined. More specifically, suppose that we are given n + 1 distinct points 
in [a, b], which we denote by do, a,,..., a, with ay =a and a, =b. Let e>0 
be given. We need to find a polynomial q(x) such that |f(x) — q(x)| < e for 
all x in [a,b], and f(a;) =q(a;), i=0,1,...,n. 
By Theorem 9.1.1 there exists a polynomial p(x) such that 


|f(x) —p(x)|<e' forall x €[a, b], 


where e’ < €/(1 +M), and M is a nonnegative number to be described later. 
Furthermore, by Theorem 9.2.1 there exists a unique polynomial u(x) such 
that 


u(a;) =f(4;) — p(4), i=0,1,...,n. 


This polynomial is given by 
u(x) = X [f(a;) —p(a;)|4i(~), (9.20) 
i=0 


where /,(x) is the ith Lagrange polynomial defined in formula (9.13). Using 
formula (9.20) we obtain 


max, Ju(x)|< X | f(a) =p(a;)| max U(x) | 


i=0 


<e'M, 


where M = X? o max, -,~<,l/;(x)|, which is some finite nonnegative number. 
Note that M depends only on [a,b] and ay, a,,...,a,. Now, define g(x) as 
q(x) = p(x) + u(x). Then 


q(4;) =p(a;) +u(a;) =f(a;), i= 0,1,...,n. 
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Furthermore, 


(Fœ) =a) |<) =p) l+ lu) | 
<e'+e'M 
<e forall xe[a,b]. 


9.3. APPROXIMATION BY SPLINE FUNCTIONS 


Approximation of a continuous function f(x) with a single polynomial p(x) 
may not be quite adequate in situations in which f(x) represents a real 
physical relationship. The behavior of such a function in one region may be 
unrelated to its behavior in another region. This type of behavior may not be 
satisfactorily matched by any polynomial. This is attributed to the fact that 
the behavior of a polynomial everywhere is governed by its behavior in any 
small region. In such situations, it would be more appropriate to partition the 
domain of f(x) into several intervals and then use a different approximating 
polynomial, usually of low degree, in each subinterval. These polynomial 
segments can be joined in a smooth way, which leads to what is called a 
piecewise polynomial function. 

By definition, a spline function is a piecewise polynomial of degree n. The 
various polynomial segments (all of degree n) are joined together at points 
called knots in such a way that the entire spline function is continuous and its 
first n — 1 derivatives are also continuous. Spline functions were first intro- 
duced by Schoenberg (1946). 


9.3.1. Properties of Spline Functions 


Let [a, b] can be interval, and let a= 7) <7, < © < Tm <7,4, =b be parti- 
tion points in [a,b]. A spline function s(x) of degree n with knots at the 
points 7,,7,,...,7,, has the following properties: 


i. s(x) is a polynomial of degree not exceeding n on each subinterval 
[r orl 1<i<m+1. 


L 
ii. s(x) has continuous derivatives up to order n — 1 on [a, b]. 


In particular, if n = 1, then the spline function is called a linear spline and 
can be represented as 


m 


s(x) ra zy a,|x — celle 


i=1 


where a4, a5,...,d,, are fixed numbers. We note that between any two knots, 
lx—7,|,i=1,2,...,m, represents a straight-line segment. Thus the graph of 
s(x) is made up of straight-line segments joined at the knots. 
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We can obtain a linear spline that resembles Lagrange interpolation: Let 


Oo, 9,,-.-,9 be given real numbers. For 1 <i <m, consider the functions 
09 “I> >Y¥m+1 >, 
( X= Ti 
> TTWSXST, 
l(x)={ To7 71 
0, Ti SX STi 
(0, xE [Gast]; 
ETs 
3 Tj-1 SX Tis 
Le) = 4 1-1 
Tiz TX 
> TF5X ST; 415 
Ti+. T; 
(0, T) SX K<T,, 
I yya C=, 
mail ) a , Tm SX STi 41 
Tm+1 Tm 
Then the linear spline 
m+1 


s(x) = L 6.1;( x) (9.21) 


has the property that s(r;) = 6,,0 <i<m +1. It can be shown that the linear 
spline having this property is unique. 

Another special case is the cubic spline for n = 3. This is a widely used 
spline function in many applications. It can be represented as s(x) = s;(x) = 
a;+b,x+¢,x? +d;x3, 1), <x<7, i=1,2,...,m+1, such that for i= 
1,2,...,m, 


S(7;) = 8;41( 7), 
S(T) = S)44(7%)> 


Si (Ti) = Si41(%)- 


In general, a spline of degree n with knots at 7,,75,...,7,, is represented 
as 


= GareG). (9.22) 


i=1 
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where @),@5,...,€,, are constants, p(x) is a polynomial of degree n, and 
n 
(x-7), = 


For an illustration, consider the cubic spline 


a +b x+c,x?+d,x°7, a<x<rT, 
s(x) = 
a, +b,x +c,x° +d,x?, T<x<b 


Here, s(x) along with its first and second derivatives must be continuous at 
x= T. Therefore, we must have 


a,+brt+c7?+d4,73=a,+b 7+ 0,7? +dy7°, (9.23) 
bi + 2c,;7 + 3,7? =b, + 20,7 + 3d,7°, (9.24) 
2c, +6d,7= 2c, + 6dyr. (9.25) 


Equation (9.25) can be written as 
ci =c, =3(d,—d,)T. (9.26) 
From equations (9.24) and (9.26) we get 
b, —b) +3(d,—d,)7? =0. (9.27) 


Using now equations (9.26) and (9.27) in equation (9.23), we obtain a, — a, + 
3(d, — d,)r? + 3(d, — d,)r* + (d, — d,)r* =0, or equivalently, 


a, —a,+(d,—d,)r* =0. (9.28) 


We conclude that 


1 
3, 6e1 — €2): (9.29) 


d,—d,= Sa =a) = sab; =bj)= 
Let us now express s(x) in the form given by equation (9.22), that is, 
s(x) =e,(x—T), + a+ ax + a,x? + ax. 
In this case, 


ay =a), aj; =b}, a, =C}, a,=d,, 
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and 
—e,7? + ay = 43, (9.30) 
3er? +a, =b, (9.31) 
—3e\T+ Q,=C), (9.32) 
ee eed, (9.33) 


In light of equation (9.29), equations (9.30)—-(9.33) have a common solu- 
tion for e} given by e,=d,—d,=(1/37Xc, -c,)=(1/3r° Xb, — b,) = 
(1/7?Ka, —a,). 


9.3.2. Error Bounds for Spline Approximation 


Let a= 1) <7, < © <%<%,+41 =b be a partition of [a, b]. We recall that 
the linear spline s(x) given by formula (9.21) has the property that s(r;) = 6,, 
0<i<m+1, where 6), 0,,...,6,,,; are any given real numbers. In particu- 
lar, if 0; is the value at 7; of a function f(x) defined on the interval [a, b], 
then s(x) provides a spline approximation of f(x) over [a,b] which agrees 
with f(x) at To, 7),--+5%4+1- If f(x) has continuous derivatives up to order 2 
over [a, b], then an upper bound on the error of approximation is given by 
(see De Boor, 1978, page 40) 


2 
max | f(x) —s(x)|<3{maxAz,) max |f(x)|, 
a<x<b i a<x<b 


where AT; = 7;,,; — T; i= 0,1,...,m. This error bound can be made small by 
reducing the value of max, Ar;. 

A more efficient and smoother spline approximation than the one pro- 
vided by the linear spline is the commonly used cubic spline approximation. 
We recall that a cubic spline defined on [a, b] is a piecewise cubic polynomial 
that is twice continuously differentiable. Let f(x) be defined on [a, b]. There 
exists a unique cubic spline s(x) that satisfies the following interpolatory 
constraints: 


5(7;) =f(7;); i=0,1,....m+1, 
s'(7) =f'(To), 
S'(Tm41) =f" (Get) (9.34) 


(see Prenter, 1975, Section 4.2). 

If f(x) has continuous derivatives up to order 4 on [a, b], then information 
on the error of approximation, which results from using a cubic spline, can be 
obtained from the following theorem, whose proof is given in Hall (1968): 
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Theorem 9.3.1. Let a=71)<7,< + <7, <%,4, =b be a partition of 
[a,b]. Let s(x) be a cubic spline associated with f(x) and satisfies the 
constraints described in (9.34). If f(x) has continuous derivatives up to order 
4 on [a, b], then 


4 
max | f(x) —s(x)|< 3gy(maxAr,) max | f(x)], 
a<x<b i a<x<b 


where AT; = Tj} — 7;,1=0,1,...,m. 

Another advantage of cubic spline approximation is the fact that it can be 
used to approximate the first-order and second-order derivatives of f(x). 
Hall and Meyer (1976) proved that if f(x) satisfies the conditions of Theorem 


9.3.1, then 
3 
max |f’(x) —s'(x)|< (max A7;] max | f(x)|, 
a<x<b i a<x<b 
2 
max |f" (x) — s” (x)| < 3{ max A7,) max |f®(x)]. 
a<x<b i a<x<b 


Furthermore, the bounds concerning |f(x) —s(x)| and |f'(x)—s'(x)| are 
best possible. 


9.4. APPLICATIONS IN STATISTICS 


There is a wide variety of applications of polynomial approximation in 
statistics. In this section, we discuss the use of Lagrange interpolation in 
optimal design theory and the role of spline approximation in regression 
analysis. Other applications will be seen later in Chapter 10 (Section 10.9). 


9.4.1. Approximate Linearization of Nonlinear Models 
by Lagrange Interpolation 


We recall from Section 8.6 that a nonlinear model is one of the form 
y(x) =h(x, 6) + €, (9.35) 


where x = (x4, X2, .. ., Xg)" is a vector of k input variables, 0 = (04, 0),..., 6,)' 
is a vector of p unknown parameters, € is a random error, and h(x, 8) is a 
known function which is nonlinear in at least one element of 0. 

We also recall that the choice of design for model (9.35), on the basis of 
the Box—Lucas criterion, depends on the values of the elements of @ that 
appear nonlinearly in the model. To overcome this undesirable design 
dependence problem, one possible approach is to construct an approximation 
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to the mean response function A(x, 0) with a Lagrange interpolating polyno- 
mial. This approximation can then be utilized to obtain a design for parame- 
ter estimation which does not depend on the parameter vector 0. We shall 
restrict our consideration of model (9.35) to the case of a single input 
variable x. 

Let us suppose that the region of interest, R, is the interval [a, b], and that 
0 belongs to a parameter space Q. We assume that: 


a. h(x, ®) has continuous partial derivatives up to order r + 1 with respect 
to x over [a,b] for all @€ Q, where r is such that r+1=p with p 
being the number of parameters in model (9.35), and is large enough so 
that 


a"* h(x, 0) 


xt! 


D a 5 9.36 
| = = < A 
pla) (036 

for all © € Q, where ô is a small positive constant chosen appropriately 

so that the Lagrange interpolation of h(x, @) achieves a certain accu- 
racy. 
b. A(x, ®) has continuous first-order partial derivatives with respect to the 

elements of 0. 

c. For any set of distinct points, x),x,,...,%,, such that a <xọ <x; < `} 
<x, <b, where r is the integer defined in (a), the p X (r + 1) matrix 


U(@) = | VA( x), 0): VA(x,, 0): = :Vh(x,,0)] 


is of rank p, where VA(x;,®) is the vector of partial derivatives of 
h(x;, 8) with respect to the elements of 6 (i = 0,1,...,r). 


Let us now consider the points Zo, z,,...,Z,, where z; is the ith Cheby- 
shev point defined by formula (9.18). Let p,(x, 0) denote the corresponding 
Lagrange interpolating polynomial for h(x, ®) over [a,b], which utilizes the 
zs as interpolation points. Then, by formula (9.14) we have 


pi(,8) = E Mzn OL), (9.37) 


where /,(x) is a polynomial of degree r which can be obtained from formula 
(9.13) by substituting z; for a; @=0,1,...,7). By inequality (9.19), an upper 
bound on the error of approximating h(x, 0) with p,(x, ) is given by 


a’* h(x, 0) 


su 
p oxtt} 


a<x<b 


b-a PEL 
a 


2 
sup |A(x,0)-p,(x,0)|< am 


a<x<b 
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However, by inequality (9.36), this upper bound is less than ô. We then have 


sup |h(x,0) —p,(x,0)| <6 (9.38) 


a<x<b 


for all © € Q. This provides the desired accuracy of approximation. 
On the basis of the above arguments, an approximate representation of 
model (9.35) is given by 


y(x) =p,(x, 0) +. (9.39) 


Model (9.39) will now be utilized in place of h(x, ®) to construct an optimal 
design for estimating 0. 

Let us now apply the Box—Lucas criterion described in Section 8.6 to 
approximate the mean response in model (9.39). In this case, the matrix H(@) 
[see model (8.66)] is an n X p matrix whose (u, i)th element is dp,(x,, 8)/00,, 
where x, is the design setting at the uth experimental run (u =1,2,...,7) 
and n is the number of experimental runs. From formula (9.37) we than have 


dp,(x,,0) Z ah(z,,0) 


=L 
30; j=0 = 99; 


Nas  i=1,2,...,p. 


These equations can be written as 
Vp,(x,,,8) = U(0) A(x); 
where A(x,,) = [/)(x,,), Lx), ..-, L (x,)]' and UCO) is the p X (r + 1) matrix 
U(0) = | Vh( zo, 0): VA(z,,0): ++: :Vh(z,, )]. 


By assumption (c), U(@) is of rank p. The matrix H(@) is therefore of the 
form 


H(@) = AU'(0), 
where 
A’ = [A( x1): A( XQ) 2 A(x,)]- 
Thus 
H’(0)H(0) = U(0) A‘AU'(6). (9.40) 
If n>r+ 1 and at least r+ 1 of the design points (that is, x,,x5,...,x,,) are 


distinct, then A'A is a nonsingular matrix. To show this, it is sufficient to 
prove that A is of full column rank r+ 1. If not, then there must exist 
constants ay, a,,..., @,, not all equal to zero, such that 


: 
Y @,l;(x,) = 9, u=1,2,...,n. 
i=0 
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This indicates that the rth degree polynomial Lj_,a;/(x) has n roots, 
namely, x,,X,,...,x,. This is not possible, because n>r-+1 and at least 
r+1 of the x,’s(u= 12 ...,”) are distinct (a polynomial of degree r has at 
most r distinct roots). This contradiction implies that A is of full column 
rank and A’A is therefore nonsingular. 

Applying the Box—Lucas design criterion to the approximating model 
(9.37) amounts to finding the design settings that maximize det[H'(0)H(0)]. 
From formula (9.40) we have 


det[H’(@)H()] = det[U(@) A’AU'(6)]. (9.41) 


We note that the matrix A'A = X}; A(x,,)A'(x,,) depends only on the design 
settings. Let v,,,(%),%5,--.,*,,) and v,,,(x), X%,...,x,,) denote, respectively, 
the smallest and the largest eigenvalue of A'A. These eigenvalues are 
positive, since A’A is positive definite by the fact that A’A is nonsingular, as 
was shown earlier. From formula (9.41) we conclude that 


det[U(0)U'(0)] v2 (x1, X2,-..,x,) < det[H'(0)H(0)] 
= det[U(6)U’ (8) |», Vax (Xis X25- ý o Xa). 
(9.42) 
This double inequality follows from the fact that the matrices 
Vmax( Xis X23- --, X )U(0)U'(0) — H'(0)H(0) and H’'(0)H(0) — 
Vanl Xi X25.. -, X )UC(O)U'(0) are positive semidefinite. An application of 
Theorem 2.3.19(1) to these matrices results in the double inequality (9.42) 
(why?). Note that the determinant of U(0)U'(0) is not zero, since U(0)U'(0), 
which is of order p X p, is of rank p by assumption (c). 


Now, from the double inequality (9.42) we deduce that there exists a 
number y, 0 < y< 1, such that 


det[H'(6)H(0)] = I Ly Bean sia X25. . Xn) + (1 T y) Viel Patsy Xp) ] 
a 


If y is integrated out, we obtain 
1 
f, det[H'(0)H()] dy= 5[¥, via (Xis X23- s X,) + VE (Xiz: . re) 
x det[U(0)U’(6)]. 


Consequently, to construct an optimal design we can consider finding 
X1,X,...,X, that maximize the function 


Ue (Vs Bigs Xn) = 5 | VE Xo X2- Xn) Game eee ree xXa)]. (9.43) 
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This is a modified version of the Box—Lucas criterion. Its advantage is that 
the optimal design is free of @. We therefore call such a design a parameter- 
free design. The maximization of p(x}, x5,...,x,,) can be conveniently car- 
ried out by using a FORTRAN program written by Conlon (1991), which is 
based on Price’s (1977) controlled random search procedure. 


EXAMPLE 9.4.1. Let us consider the nonlinear model used by Box and 
Lucas (1959) of a consecutive first-order chemical reaction in which a raw 
material A reacts to form a product B, which in turn decomposes to form 
substance C. After time x has elapsed, the mean yield of the intermediate 
product B is given by 


0, 


h(x, 0) = 
PUS 


(e7 L e7"), 


where 6, and 6, are the rate constants for the reactions A > B and B >C, 
respectively. 

Suppose that the region of interest R is the interval [0,10]. Let the 
parameter space Q be such that 0 < 6, < 1, 0 < 6, < 1. It can be verified that 


a"+ih(x,0) 


6, 
ox't} a 


Z Ssi ae l 
(-1) = 85 


(0577 e 92% — o+! ent yi: 


Let us consider the function w(x, 6) = ¢’*! e~**. By the mean value theo- 
rem (Theorem 4.2.2), 


ðw(x, 6, ) 
ap 


rt+1 ,-6,x _ grt! ,-60)x — = 
05™ e7 ®2* — Orth e7" = (0, — 04) 


> 


where dw(x, 6,.)/d¢ is the partial derivative of w(x, ) with respect to # 
evaluated at 0,, and where 6, is between @, and 0,. Thus 


O5tte- G2 — Gite tt (05 - 6,)[(r+ 1) Oe ** — Oe ae es |) 
Hence, 


a”* "h(x, 0) 


xt! 


< 0,0% sup [e=] + 1 = x0] 


0<x<10 


sup 
0<x<10 


< sup |r+1-—x0,l. 
0<x<10 
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However, 


r+1-x0, ifr+1>x0,, 


Ir+1-x0,| = —r-1+x06,, ifr+1<x0,.. 


Since 0 <x0, < 10, then 


sup |r+1—x6,| <max(r+1,9-r). 


0<x<10 
We then have 
a”*'h(x, 0) 
sup |———,——|S max(r+1,9-r). 
O<x<10 ax 


By inequality (9.36), the integer r is determined such that 
2 10 


r+1 
(r+D! z) max(r+1,9-r) <6. 
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(9.44) 


If we choose 6 = 0.053, for example, then it can be verified that the smallest 
positive integer that satisfies inequality (9.44) is r = 9. The Chebyshev points 
in formula (9.18) that correspond to this value of r are given in Table 9.2. On 
choosing n, the number of design points, to be equal to r + 1 = 10, where all 
ten design points are distinct, the matrix A in formula (9.40) will be 
nonsingular of order 10 x 10. Using Conlon’s (1991) FORTRAN program for 
the maximization of the function y in formula (9.43) with p = 2, it can be 
shown that the maximum value of # is 17.457. The corresponding optimal 


values of x), ,...,Xj9 are given in Table 9.2. 


Table 9.2. Chebyshev Points and Optimal Design 
Points for Example 9.4.1 


Chebyshev Points Optimal Design Points 
9.938 9.989 
9.455 9.984 
8.536 9.983 
7.270 9.966 
5.782 9.542 
4.218 7.044 
2.730 6.078 
1.464 4.038 
0.545 1.381 


0.062 0.692 
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9.4.2. Splines in Statistics 


There is a broad variety of work on splines in statistics. Spline functions are 
quite suited in practical applications involving data that arise from the 
physical world rather than the mathematical world. It is therefore only 
natural that splines have many useful applications in statistics. Some of these 
applications will be discussed in this section. 


9.4.2.1. The Use of Cubic Splines in Regression 


Let us consider fitting the model 


y=g(x) +, (9.45) 


where g(x) is the mean response at x and e is a random error. Suppose that 
the domain of x is divided into a set of m+ 1 intervals by the points 
To <7, < < Tm <T%4, Such that on the ith interval (i =1,2,...,m+ 1), 
g(x) is represented by the cubic spline 


s(x) =a; +b;x +c;x? +d;x?, T4 <x <7,;. (9.46) 

As was seen earlier in Section 9.3.1, the parameters a;, b; c;d; (i= 
1,2,...,m + 1) are subject to the following continuity restrictions: 

a, + bit, + cir +dit? = 4). + birit; +CT tdir, (9.47) 


that is, s(7;) =5;,,(7;), i= 1,2,..., m; 


b; + 2¢;7, + 3d;77 =b; + 2¢)447; + 3dj4:77, (9.48) 


that is, s;(r;) =), (7), i= 1,2,..., m; and 
2c, + 6d,t, =2c,,, + 6d;,17;, (9.49) 


that is, s/(7,)=s1,,(7,), i=1,2,...,m. The number of unknown parameters 
in model (9.45) is therefore equal to 40m + 1). The continuity restrictions 
(9.47)-(9.49) reduce the dimensionality of the parameter space to m + 4. 
However, only m + 2 parameters can be estimated. This is because the spline 
method does not estimate the parameters of the s,’s directly, but estimates 
the ordinates of the s;s at the points To, 7,,.--, %41, that is, s,(7)) and s,(7;,), 
i=1,2,...,m + 1. Two additional restrictions are therefore needed. These 
are chosen to be of the form (see Poirier, 1973, page 516; Buse and Lim, 
1977, page 64): 


Si(7) = Tosi (T1), 
or 


2c, + 6diTo = 7 (2c, + 6d,7,), (9.50) 
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and 


Sin+1(Tm+1) = Tn +1 Sin41 Tn)» 
or 


Zemi tOdmtiTm+1 = m+1(2Cm41 T 6d i Tn) (9.51) 


where mo and 7,,,, are known. 

Let y,, Y2,.--, Y, be n observations on the response y, where n >m + 2, 
such that n; observations are taken in the ith interval [7;_,,7,],i=1,2,...,m 
+1. Thus n=D7%*'n,. If Yir Yi2>+++s Yin, are the observations in the ith 
interval (i = 1,2,...,m+ 1), then from model (9.45) we have 


Yi =8 (4) + Eijs i=1,2,...,m+1;j=1,2,...,n,, (9.52) 


where x;; is the setting of x for which y =y;;, and the ¢;,’s are distributed 
independently with means equal to zero and a common variance o°. The 
estimation of the parameters of model (9.45) is then reduced to a restricted 
least-squares problem with formulas (9.47)—(9.51) representing linear restric- 
tions on the 4(m + 1) parameters of the model [see, for example, Searle 
(1971, Section 3.6), for a discussion concerning least-squares estimation 
under linear restrictions on the fitted model’s parameters]. Using matrix 
notation, model (9.52) and the linear restrictions (9.47)-(9.51) can be ex- 
pressed as 


y=XB te, (9.53) 
Cp =5, (9.54) 


where y = (Yj: Yo: °°! Ym+1)' With y; = (va, Vins -e> Vin)’, $= 1,2,...,m+1,X 
= Diga(X,, X,,...,X,,4,) is a block-diagonal matrix of order n x [4(m + 1)] 
with X, being a matrix of order n; x4 whose jth row is of the form 
(xxxi) J=1,2,... n; i=1,2,...,m+ 1; BH (B Bos Baa) 
with B;=(a;, b; cpd), i=1,2,...,m +1; and e=(e}: €: «+: €En) 
where e; is the vector of random errors associated with the observations in 
the ith interval, i=1,2,...,m+1. Furthermore, C =[Cp: Cj: Cy: ChI, 
where, for l= 0,1, 2, 
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is a matrix of order m X[4(m + 1)] such that eg =(1, T, 77,77), € = 
(0,1,27,, 372), e> = (0,0,2,67,), i= 1,2,...,m, and 


0 0 27-1) 6(m7,-7) = 0 0 0 0 


= 0 0 0 0 g 0 0 (Tnt -1) 6( Tn+1Tm — m11) 


is a 2X[4(m + 1)] matrix. Finally, = (8%: 5): 55: ©) =0, where the 
partitioning of into 8, 5,, 55, and 8; conforms to that of C. Conse- 
quently, and on the basis of formula (103) in Searle (1971, page 113), the 
least-squares estimator of B for model (9.53) under the restriction described 
by formula (9.54) is given by 


6,= 8 -(x'x) Ejea e] ch- s) 
=ĝ- (Xx) 'e'[c(x’x)'c] ch, 


where B = (X'X)~'X’y is the ordinary least-squares estimator of B. 

This estimation procedure, which was developed by Buse and Lim (1977), 
demonstrates that the fitting of a cubic spline regression model can be 
reduced to a restricted least-squares problem. Buse and Lim presented a 
numerical example based on Indianapolis 500 race data over the period 
(1911-1971) to illustrate the implementation of their procedure. 

Other papers of interest in the area of regression splines include those of 
Poirier (1973) and Gallant and Fuller (1973). The paper by Poirier discusses 
the basic theory of cubic regression splines from an economic point of view. 
In the paper by Gallant and Fuller, the knots are treated as unknown 
parameters rather than being fixed. Thus in their procedure, the knots must 
be estimated, which causes the estimation process to become nonlinear. 


9.4.2.2. Designs for Fitting Spline Models 


A number of papers have addressed the problem of finding a design to 
estimate the parameters of model (9.45), where g(x) is represented by a 
spline function. We shall make a brief reference to some of these papers. 
Agarwal and Studden (1978) considered a representation of g(x) over 
0 <x <1 by a linear spline s(x), which has the form given by (9.21). Here, 
g” (x) is assumed to be continuous. If we recall, the 6, coefficients in formula 


(9.21) are the values of s at To, Tis... Mn41- 
Let x4, X2,..., X, be r design points in [0,1]. Let y; denote the average of 
n; observations taken at x; (i= 1,2,...,r). The vector © = (6p, 9),..-, 9,41)" 
can therefore be estimated by 
6 = Ay, (9.55) 
where y = (yj, ¥,.--, y,)’ and A is an (m + 2) Xr matrix. Hence, an estimate 
of g(x) is given by 
(x) =I'(x)0=1'(x) Ay, (9.56) 


where I(x) = [/)(), 1,(4),.--5 Ln 0N. 
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Now, E(6) = Ag,, where g,=[g(x,), g(x,),..., g(x,)]’. Thus E[g(x)] = 
I'(x)Ag,, and the variance of (x) is 


Var[ ¢(x)] = E[r (x) -1 (x)Ag, | 
= V(x)AD“'A), 


where D is an rXr diagonal matrix with diagonal elements 
n,/n,n,/n,...,n,/n. The mean squared error of g(x) is the variance plus 
the squared bias of (x). It follows that the integrated mean squared error 
(IMSE) of g(x) (see Section 8.4.3) is 

NW p{ NO +1 

J= =f Var| ê(x)] dx + — | Bias*[8(x)] dx 
a” “0 o -0 
=V+B, 

where w=({j dx)! = 4, and 


Bias”[ $(x)] =[g(x) —I'(x)Ag,]’. 
Thus 


n 


207 


1 1 1 2 
J= 5 [V(2)AD A(x) de + J 186) -V@)As,] dx 


L (ADA am ' ? dx 
= 5 tr(AD“'A'M) + z= | [8() -V (x)Ag,]? dr, 
where M = SHON (x) dx. 

Agarwal and Studden (1978) proposed to minimize J with respect to (i) 
the design (that is, x4, x2,..., X, as well as n4, n3,..., n,), Gi) the matrix A, 
and (iii) the knots 7,,75,..., Tp, assuming that g is known. 

Park (1978) adopted the D-optimality criterion (see Section 8.5) for the 
choice of design when g(x) is represented by a spline of the form given by 
formula (9.22) with only one intermediate knot. 

Draper, Guttman, and Lipow (1977) extended the design criterion based 
on the minimization of the average squared bias B (see Section 8.4.3) to 
situations involving spline models. In particular, they considered fitting 
first-order or second-order models when the true mean response is of the 
second order or the third order, respectively. 


9.4.2.3. Other Applications of Splines in Statistics 


Spline functions have many other useful applications in both theoretical and 
applied statistical research. For example, splines are used in nonparametric 
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regression and data smoothing, nonparametric density estimation, and time 
series analysis. They are also utilized in the analysis of response curves in 
agriculture and economics. The review articles by Wegman and Wright (1983) 
and Ramsay (1988) contain many references on the various uses of splines in 
statistics (see also the article by Smith, 1979). An overview of the role of 
splines in regression analysis is given in Eubank (1984). 
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EXERCISES 


In Mathematics 


9.1. 


9.2. 


9.3. 


9.4. 


9.5. 


9.6. 


9.7. 


9.8. 


Let f(x) be a function with a continuous derivative on [0,1], and let 
b,(x) be the nth degree Bernstein approximating polynomial of f. 
Then, for some constant c and for all n, 


sup [f(8) bi) |< 7 


O<x< 


Prove Lemma 9.1.1. 
Prove Lemma 9.1.2. 


Show that for every interval [—a, a] there is a sequence of polynomials 
p,(x) such that p,(0) =0 and lim, _,., p,(x) = |x| uniformly on [—a, a]. 


Suppose that f(x) is continuous on [0,1] and that 
fie =U, “WweG Meus 
0 


Show that f(x) =0 on [0, 1]. 
[ Hint: fif(x)p,(x) dx = 0, where p,(x) is any polynomial of degree n.] 


Suppose that the function f(x) has n+1 continuous derivatives on 
[a,b]. Let a=ay <a, < = <a,=b be n +1 points in [a, b]. Then 


AFi 


sup | f(x) —p(x)|< 2, 
a<x<b 4(n +1) 


where p(x) is the Lagrange polynomial defined by formula (9.14), 
T41 = SUP, < < plf” CO), and h = max(a,,, —a;), i=0,1,...,n— 1. 
[ Hint: Show that |I ox- a| <n\(h"*! /4).] 


Apply Lagrange interpolation to approximate the function f(x) = log x 
over the interval [3.50,3.80] using a) = 3.50, a, = 3.60, a, = 3.70, and 
a, = 3.80 as interpolation points. Compute an upper bound on the 
error of approximation. 


Let a=7)<7,< +: <7,=b be a partition of [a,b]. Suppose that 
f(x) has continuous derivatives up to order 2 over [a,b]. Consider a 
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cubic spline s(x) that satisfies 
s(7;) =f(7,), i=0,1,...,n, 
s'(a) =f'(a), 
s'(b) =f'(b). 


Show that 


PU Co} ace fI ae. 


9.9. Determine the cubic spline approximation of the function f(x) = 


cos(27r x) over the interval [0, zr] using five evenly spaced knots. Give an 
upper bound on the error approximation. 


In Statistics 


9.10. 


Consider the nonlinear model 
y(x) =h(x,0) +e, 
where 
h(x,®) = o (1 — 6, e *3*), 


such that 0 < 6, < 50, 0 < 0, <1, 0 < 6; < 1. Obtain a Lagrange inter- 
polating polynomial that approximates the mean response function 
h(x, 8) over the region [0, 8] with an error not exceeding 6 = 0.05. 


. Consider the nonlinear model 


y =a + (0.49 — a)exp[— B(x -8)] + €, 
where e is a random error with a zero mean and a variance o’. 
Suppose that the region of interest is the interval [10,40], and that the 
parameter space Q is such that 0.36 < aœ < 0.41, 0.06 < B < 0.16. Let 
s(x) be the cubic spline that approximates the mean response, that is, 
n(x, a, B) = a + (0.49 — a) exp[— B(x — 8)], over [10,40]. Determine 
the number of knots needed so that 


,a, B)— < 0.001 
1021240 lies BISA] 


for all (a, B) EQ. 
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9.12. Consider fitting the spline model 


y=By+ Bix + B(x-a), +e 


over the interval [—1,1], where «œ is a known constant, —1 < a < 1. A 
three-point design consisting of x4, X2, x, with -1<x,<a<x,<x,< 
1 is used to fit the model. Using matrix notation, the model is written as 


y=XB+e. 
where X is the matrix 
1 x, 0 
X= 1 Xa (x2 = a)? ? 
TL x a= a) 
and B =( By, Bi, B2)'. Determine x,,x,,x, so that the design is D- 


optimal, that is, it maximizes the determinant of X'X. 
[ Note: See Park (1978).] 


CHAPTER 10 


Orthogonal Polynomials 


The subject of orthogonal polynomials can be traced back to the work of the 
French mathematician Adrien-Marie Legendre (1752-1833) on planetary 
motion. These polynomials have important applications in physics, quantum 
mechanics, mathematical statistics, and other areas in mathematics. 

This chapter provides an exposition of the properties of orthogonal 
polynomials. Emphasis will be placed on Legendre, Chebyshev, Jacobi, 
Laguerre, and Hermite polynomials. In addition, applications of these poly- 
nomials in statistics will be discussed in Section 10.9. 


10.1. INTRODUCTION 


Suppose that f(x) and g(x) are two continuous functions on [a, b]. Let w(x) 
be a positive function that is Riemann integrable on [a, b]. The dot product 
of f(x) and g(x) with respect to w(x), which is denoted by (f-g),,, is defined 
as 


(Fa)o= ff x)g(xw(x) de. 


The norm of f(x) with respect to w(x), denoted by |Ifll., is defined as 
If llo=L (Pf ?)w(x) dx]'/?. The functions f(x) and g(x) are said to be 
orthogonal [with respect to w(x)] if (f-g),,=0. Furthermore, a sequence 
{f° _ of continuous functions defined on [a, b] are said to be orthogonal 
with respect to w(x) if (f,,-f,),,=0 for m +n. If, in addition, ||f,|l,, = 1 for 
all n, then the functions f,(x), 1=0,1,2,..., are called orthonormal. In 
particular, if S ={p,(x)}¥_) is a sequence of polynomials such that (p, 'Pm)o 
=0 for all m #n, then S forms a sequence of polynomials orthogonal with 
respect to w(x). 

A sequence of orthogonal polynomials can be constructed on the basis of 
the following theorem: 
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Theorem 10.1.1. The polynomials {p,(x)}"_) which are defined accord- 
ing to the following recurrence relation are orthogonal: 


Po(*) =1, 
(P0'Po) « (x1), 
= 2 ~*~ 7? 
Il Poll eae (10.1) 


p(x) =x 


PA) = (= 6, ) P14) Pha) n=2,3,..., 


where 
(XPn-1'Pn-1)o 
Dale on 
n-lilo 
(2Pn-1°Pn-2) w 
pale oa 
n-2 ilo 


Proof. We show by mathematical induction on n that (p,-p;),=0 for 
i<n. For n=1, 


x1), 
(Prem fs E w(x) dx 
iggy Mle 
À Ilk 


Now, suppose that the assertion is true for n — 1 (n > 2). To show that it is 
true for n. We have that 


(Pn Pi)o = fies Ay) Pn—1(X) — b, pa-2(x)] p(x) w(x) dx 
Syn Di) Geek Ds OA Pn-2 Ds 
Thus, for i =n — 1, 


2! 
(Pr Pi) o = (Pn Daa dis in nll Pa—1 lo > bil Pn—2 Dizi) 
=0, 
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by the definition of a, in (10.2) and the fact that (p,_>°P,—1)5=(Dn-1° 
Pn—2)o = 9. Similarly, for i =n — 2, 


(Pr Pio = (Pn-1 Pn-2) 0 = a,( Pn-1 Deo), = bil Pn—2 ‘Pa-2)o 


i (XPn-1 Deen) i ball Pn—2 È, 
=0, by (10.3). 


Finally, for i <n — 2, we have 
(Pr Pi) o = (XPn-1 Pi) o 7 In Pa-1 Pi) o — Onl Pn-2 Pi) o 
= fip, 1O) pil 2) (x) de. (10.4) 
But, from the recurrence relation, 
Pii(*) = (x= 4541) PAX) — Bj41 Pi-1 (4), 
that is, 
XP X) =Pisi(x) + 4541 D(X) + Bj41 Pi-1(%)- 


It follows that 
b 
[2P-1(*) Bix) w(x) d 


z3 TE + 4341 P(X) + bii Pi-1(*) w(x) dx 


= ( Pa-1 'Pi+1) o + 41410 Pa-1 Pi) o + bisi Pn-1 Pi-1) 0 
=0. 


Hence, by (10.4), Lp; Pio =0. a 


It is easy to see from the recurrence relation (10.1) that D,Ax) is of degree 
n, and the coefficient of x” is equal to one. Furthermore, we have the 
following corollaries: 


Corollary 10.1.1. An arbitrary polynomial of degree <n is uniquely 
expressible as a linear combination of p(x), p(x), ..., p,(x). 


1 


Corollary 10.1.2. The coefficient of x”~* in p,(x) is —Xj_,a; (n > 1). 


Proof. If d, denotes the coefficient of x”~' in p,(x) (n > 2), then by 
comparing the coefficients of x"~' on both sides of the recurrence relation 
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(10.1), we obtain 
d =d, —-a,,  n=2,3,.... (10.5) 
The result follows from (10.5) and by noting that d, = —a, m 


Another property of orthogonal polynomials is given by the following 
theorem: 


Theorem 10.1.2. If {p,(x)}7_) is a sequence of orthogonal polynomials 
with respect to w(x) on [a,b], then the zeros of p,(x) (n > 1) are all real, 
distinct, and located in the interior of [a, b]. 


Proof. Since (p,*Py), =0 for n=1, then {?p,(x)w(x) dx = 0. This indi- 
cates that p,(x) must change sign at least once in (a, b) [recall that w(x) is 
positive]. Suppose that p„(x) changes sign between a and b at just k points, 
denoted by x,,%,,...,x,. Let g(x)=(x—xıXx— x)= (x—x,) Then, 
P,(x)g(x) is a polynomial with no zeros of odd multiplicity in (a, b). Hence, 
ep, g(x)w(x) dx # 0, that is, (p, -g),, #0. If k <n, then we have a contra- 
diction by the fact that p, is orthogonal to g(x) [g(x), being a polynomial of 
degree k, can be expressed as a linear combination of p(x), p,(x),..., p(x) 
by Corollary 10.1.1]. Consequently, k =n, and p,(x) has n distinct zeros in 
the interior of [a, b]. oO 


Particular orthogonal polynomials can be derived depending on the choice 
of the interval [a,b], and the weight function w(x). For example, the 
well-known orthogonal polynomials listed below are obtained by the follow- 
ing selections of [a, b] and w(x): 


Orthogonal Polynomial a b w(x) 

Legendre -1 1 1 

Jacobi -1 1 Q-x)*0+x)% a,B>-1 
Chebyshev of the first kind -1 1 Q-x? 

Chebyshev of the second kind ~1 1 (1—x?)!”? 

Hermite -œ œ e* /2 

Laguerre 0 © e*x*, a>-—l1 


These polynomials are called classical orthogonal polynomials. We shall 
study their properties and methods of derivation. 


10.2. LEGENDRE POLYNOMIALS 


These polynomials are derived by applying the so-called Rodrigues formula 


1 d"(x?-1)" 
Pl) = Fant ee 


n=0,1,2,.... 
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Thus, for n = 0,1,2,3,4, for example, we have 


P(x) = 1, 
p(x) =x, 


p(x) = ah = oe 


a(t) = $x? — $r, 


pax) = Bxt— Beth. 


From the Rodrigues formula it follows that p,(x) is a polynomial of degree n 
and the coefficient of x” is > ) /2". We can multiply p,(x) by 2”/ ( z) to 
make the coefficient of x” equal to one (n = 1,2,...). 


Another definition of Legendre polynomials is obtained by means of the 
generating function, 


g(x, r) = ———_zars 
(1—2nm-+r?)'” 
by expanding it as a power series in r for sufficiently small values of r. The 
coefficient of r” in this expansion is p,(x), n =0,1,..., that is, 


e(x,r) = E pals) 


To demonstrate this, let us consider expanding (1 — z)~'”” in a neighborhood 


of zero, where z = 2xr — r?: 


1 
(any 


=14+5zt+3227+h234+-,  |z|<1, 


=14+4(2xr—r?) + 3(2ar—r?) + Z(2ar—-7?) $+ 


=1 +r + (ge 3)? + (3x? ix) t es 


We note that the coefficients of 1, r, r?°, and r° are the same as p(x), p(x), 
p(x), p(x), as was seen earlier. In general, it is easy to see that the 
coefficient of r” is p,(x) (n = 0,1,2,...). 

By differentiating g(x,r) with respect to r, it can be seen that 


(1 -2m+r7) 


a —(x-r)g(x,r) =0. 
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By substituting g(x,r) = X7 -o p,(x)r” in this equation, we obtain 


(1-2 +r) D npa a)r" —(x-r) E pals)" =0. 


The coefficient of r” must be zero for each n and for all values of x 
(n = 1,2,...). We thus have the following identity: 


(1 +1)p,4,(%) — (2n + 1) xp,(x) + np, (x) =0, n=1,2,... (10.6) 


This is a recurrence relation that connects any three successive Legendre 

3 Ce oe 53 8 . 
polynomials. For example, for p(x) = 5x^-— 5, p3(x) = 5x° — 5x, we find 
from (10.6) that 


p(x) = [7xp3(*) Za 3p,(x)] 


— 35,4 _ 30,2, 3 
gu gu Figs 


10.2.1. Expansion of a Function Using Legendre Polynomials 


Suppose that f(x) is a function defined on [— 1, 1] such that J f(x) p,.Cx) dx 


exists for n = 0,1,2,.... Consider the series expansion 
f(x) = } ap(x). (10.7) 
i=0 


Multiplying both sides of (10.7) by p,(x) and then integrating from —1 to 1, 
we obtain, by the orthogonality of Legendre polynomials, 


a,= Lf pa) a| fir dx, n=0,1,.... 


It can be shown that (see Jackson, 1941, page 52) 


1 
2(x) dx = ; =0,1,2,.... 
fJ Pils) 2n+1 n 


Hence, the coefficient of p,(x) in (10.7) is given by 


2n4+1 4 
a, = —S— Jf) Pal 2) d. (10.8) 
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If s,(x) denotes the partial sum ©7_,a;p;(x) of the series in (10.7), then 


n+1 4 
S,(x) = ; f AD [Pa (t) Pn( *) —p,(t) Pra i(x)|adt, 


-_1t-Xx 


n=0,1,2,.... (10.9) 


This is known as Christoffel’s identity (see Jackson, 1941, page 55). If f(x) is 
continuous on [—1,1] and has a derivative at x =x), then lim, .,,, 5,(%9) = 
f(x,), and hence the series in (10.7) converges at x, to the value f(x,) (see 
Jackson, 1941, pages 64-65). 


10.3. JACOBI POLYNOMIALS 


Jacobi polynomials, named after the German mathematician Karl Gustav 
Jacobi(1804-1851), are orthogonal on [—1,1] with respect to the weight 
function w(x) =(1 —x)*(1 +x)8, a> —1, B>-—1. The restrictions on a 
and $ are needed to guarantee integrability of w(x) over the interval [—1, 1]. 
These polynomials, which we denote by p{* P)(x), can be derived by applying 
the Rodrigues formula: 


(-1)" 7 a(x) (x) P| 
(a, B) te = B 
H POSS Sa - x) (1 +2) mF 
n=0,1,2,.... (10.10) 


This formula reduces to the one for Legendre polynomials when a= 6 = 0. 
Thus, Legendre polynomials represent a special class of Jacobi polynomials. 

Applying the so-called Leibniz formula (see Exercise 4.2 in Chapter 4) 
concerning the nth derivative of a product of two functions, namely, f(x) = 
(1—x)°*" and g,(x) =(1 +x)**" in (10.10), we obtain 


d|a- a] a l 
dx" -È a mEes 
i=0 
(10.11) 
where for i=0,1,...,n, f®(x) is a constant multiple of (1—x)**"' = 


a-=x) —x)""', and g(x) is a constant multiple of (1 +x)f* = 
(1 +x)8(1 +x)‘. Thus, the nth derivative in (10.11) has (1 —x)“(1+x)* as a 
factor. Using formula (10.10), it can be shown that p‘* (x) is a polynomial 
of degree n with the leading coefficient (that is, the coefficient of x”) equal 
to /2"nA)TQn+a+B+D/Tn+a+ B+1). 
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10.4. CHEBYSHEV POLYNOMIALS 


These polynomials were named after the Russian mathematician Pafnuty 
Lvovich Chebyshev (1821-1894). In this section, two kinds of Chebyshev 
polynomials will be studied, called, Chebyshev polynomials of the first kind 
and of the second kind. 


10.4.1. Chebyshev Polynomials of the First Kind 
These polynomials are denoted by T(x) and defined as 


T,,(x) = cos(n Arccos x), n=0,1,..., (10.12) 


where 0 < Arccosx < m. Note that T,(x) can be expressed as 


T(x) =x" + H 1) + HONGE 1) +, n=0,1,..., 


(10.13) 


where —1<x<1. Historically, the polynomials defined by (10.13) were 
originally called Chebyshev polynomials without any qualifying expression. 
Using (10.13), it is easy to obtain the first few of these polynomials: 


Ty(x) =1, 

T(x) =x, 

T,(x) =2x?-1, 

T;(x) = 4x? - 3x, 

T,(x) = 8x* -— 8x? +1, 
Ts(x) = 16x° — 20x? + 5x, 


The following are some properties of 7,,(x): 


1. =1 Tx) <1 for —-1<x<1. 
2. T,(—x) =(—1)” T(x). 
3. T,(x) has simple zeros at the following n points: 


(2i — 1)7 
2n | 


g= cosl =1,2,...,n. 
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We may recall that these zeros, also referred to as Chebyshev points, 
were instrumental in minimizing the error of Lagrange interpolation in 
Chapter 9 (see Theorem 9.2.3). 

4. The weight function for T(x) is w(x) = (1 —x”)7'/”. To show this, we 
have that for two nonnegative integers, m, n, 


f cosmocosnodo=0, m+n, (10.14) 
0 
and 
f” cos? no d0= DES PRN (10.15) 
0 T, n=0 


Making the change of variables x= cos 0 in (10.14) and (10.15), we 
obtain 


0, m+n, 


1 Ty x)T (2) 
k (1—x?)'” i 
a T; (x) a= (7> aie. 


~1(1—x?)'” T, n=0. 


This shows that {T,(x)}*_, forms a sequence of orthogonal polynomials 
on [—1,1] with respect to w(x) =(1—x’) 1”. 
5. We have 


T,,41(*) = 2xT,(x) — T,_,(%), n=1,2,.... (10.16) 


To show this recurrence relation, we use the following trigonometric 
identities: 


cos[(n + 1)0] = cos 6 cos 0 — sin n0 sin 0, 
cos| (n — 1) 6] = cos n0 cos 6+ sin n6 sin 0. 
Adding these identities, we obtain 
cos[(n + 1) 6] =2cos n6 cos @—cos[(n—1)6]. (10.17) 
If we set x= cos 0 and cosn@=T,(x) in (10.17), we obtain (10.16). 
Recall that T(x) = 1 and T,(x) =x. 


10.4.2. Chebyshev Polynomials of the Second Kind 


These polynomials are defined in terms of Chebyshev polynomials of the first 
kind as follows: Differentiating T(x) =cos1@ with respect to x = cos 0, we 
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obtain, 
dT, x) Pals 
y Sinn 
sin n0 
= sinb’ 
Let U,(x) be defined as 
1 dTI,,(x 
TE (2) 
n+1 dx 


_ sin[(n + 1)6] 


sin 0 


, n=0,1,... (10.18) 


This polynomial, which is of degree n, is called a Chebyshev polynomial of 
the second kind. Note that 


sin n0 cos 0 + cos nð sin 0 


Ua) sin 0 


=xU,_(x)+T,(x),  n=1,2,..., (10.19) 
where U,(x)=1. Formula (10.19) provides a recurrence relation for U,(x). 


Another recurrence relation that is free of T(x) can be obtained from the 
following identity: 


sin[(m + 1) 6] = 2sin n0 cos 6—sin[(n—1)6]. 
Hence, 
U,(x) =2xU,_,(x) —U,_,(*), n=2,3,.... (10.20) 
Using the fact that Uj(x)=1, U(x) = 2x, formula (10.20) can be used to 
derive expressions for U,(x), n = 2,3,.... It is easy to see that the leading 
coefficient of x” in U,(x) is 2”. 
We now show that {U,(x)¥} o forms a sequence of orthogonal polynomials 


with respect to the weight function, w(x) = (1 —x’)'/?, over [—1,1]. From 
the formula 


[sin[(m + 1) 6]sin[(n + 1) 0] d0=0, m#n, 
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we get, after making the change of variables x = cos 6, 


1/2 


J Un x)U (2) = 2°) 


dx =0, m+n. 


This shows that w(x) =(1 — x°)" is a weight function for the sequence 
U, o. Note that [1U -x dx = 7/2. 


10.5. HERMITE POLYNOMIALS 
Hermite polynomials, denoted by {H,,(x)}¥"7_,, were named after the French 


mathematician Charles Hermite (1822-1901). They are defined by the Ro- 
drigues formula, 


2% d”(e™/) 
H,(x)=(-1) e* = a n=0,1,2,.... (10.21) 
From (10.21), we have 
ate? 2) beans 
oe) e H,(x). (10.22) 


By differentiating the two sides in (10.22), we obtain 


die?) r i : dH,(x) 
agen = =U) —xe* ?H (x) +e™ eae ae . (10.23) 
But, from (10.21), 
d”*! ee /2 i 
Z =(-1)"**e-’ 7H (x). (10.24) 
From (10.23) and (10.24) we then have 
dH,(x) 
H,„+1ı(x) =xH,,(x) — — n=0,1,2,..., 


which defines a recurrence relation for the sequence {H,(x)}*_,. Since 
H)(x)=1, it follows by induction, using this relation, that H,(x) is a 
polynomial of degree n. Its leading coefficient is equal to one. 
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Note that if w(x)=e7* ””, then 


x? t’ 
w(x—t) = exp Pig TE ine 


t? 
=w(x) exper 5]. 


Applying Taylor’s expansion to w(x — t), we obtain 


n 


-1)_a"[w(x)] 
n! h dx” 


w(x-t)= } 
n=0 
= È —H,(x)w(x). 
Consequently, H,,(x) is the coefficient of t”/n! in the expansion of exp(tx — 
t?/2). It follows that 
[2] [4] [6] 
n n i n 


H (x) =x" — —— x"? + —— x" — x” + 
a(x) 2.1! 27-2! 23-3! 


where n”l= n(n — 1)(n — 2) +- (n —r +1). This particular representation of 
H,(x) is given in Kendall and Stuart (1977, page 167). For example, the first 
seven Hermite polynomials are 


A(x) =1, 
H,(x) =x, 
H,(x) =x? -1, 


H,(x) =x? — 3x, 

H,(x) =x* — 6x? + 3, 

H,(x) =x? — 10x? + 15x, 
H,(x) =x® — 15x4 + 45x? — 15, 


Another recurrence relation that does not use the derivative of H,,(x) is 
given by 


Hi 4\(*) =xH,(x) — nH,,_)(x), n=1,2,..., (10.25) 
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with H,(x)=1 and H,(x) =x. To show this, we use (10.21) in (10.25): 

ae?) 
dx") 

ae) 
dx” 


(=1) 7 6"? 


a er A) 


n x? 
=x(-1) e 12 det} 


—n(-1)" te"? 


or equivalently, 


Ger?) ale 7) ad er?) 
= =x +n ; 
dx"*! dx" dx"! 
This is true given the fact that 
d(e*’/?) 
ar ae = 


Laa 
—xe* A, 
Hence, 


gle re) a? (xe ™/?) 


dx”+! dx” 
d”! en /2 d” ew 22 
=n = E — 5 (10.26) 


which results from applying Leibniz’s formula to the right-hand side of (10.26). 

We now show that {H,,(x)}"_) forms a sequence of orthogonal polynomials 
with respect to the weight function w(x)=e-*’/* over (—%,%). For this 
purpose, let m,n be nonnegative integers, and let c be defined as 


c= fe? 7H, (x)H,(x) dr. (10.27) 
Then, from (10.21) and (10.27), we have 


ba d” eur /2 
c= (<1)" J Ag) a, 


Integrating by parts gives 


d”-! eux /2 7 » dH (x) d"7! err /2 
soy [eo EA] afe x } = lax 


=- f cae EE (10.28) 


—» dx a 
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Formula (10.28) is true because H,,(x) d"-'(e-*"/?)/dx"~!, which is a poly- 
nomial multiplied by e* ”, has a limit equal to zero as x > F~, By 
repeating the process of integration by parts m — 1 more times, we obtain for 
n>m 


man pe | Hala) | a" (er 7 
c=(-1) J. - )] = de (10.29) 


Note that since H,,,(x) is a polynomial of degree m with a leading coefficient 
equal to one, d”[ H „(x)]/dx” is a constant equal to m!. Furthermore, since 
n >m, then 


o d”7™ er /2 
f a dx =0. 
dx 


— 00 


It follows that c =0. We can also arrive at the same conclusion if n <m. 
Hence, 


f e7 7H,(x)H,(x) de=0, men. 


This shows that (HCN =o is a sequence of orthogonal polynomials with 
respect to w(x) =e * ” over (—~, %), 
Note that if m =n in (10.29), then 


s d"|H,(x)|] _. 
= —x° /2 
f Te” e dx 


%2 2 
=n! f e™ dx 


= 2n! f e™/2 dx 
0 


=nly27 


By comparison with (10.27), we conclude that 
f e™ PRR (x) de=nly2r. (10.30) 


Hermite polynomials can be used to provide the following series expansion 
of a function f(x): 


f(x) = : c,H,(*), (10.31) 
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where 


1 co 
C, = TA Je 


Formula (10.32) follows from multiplying both sides of (10.31) with 
e* 7H (x), integrating over (—%,%), and noting formula (10.30) and the 
orthogonality of the sequence {H (x) 0- 


-?/2f(x)H, (x) dx, n=0,1,.... (10.32) 


10.6. LAGUERRE POLYNOMIALS 


Laguerre polynomials were named after the French mathematician Edmond 
Laguerre (1834-1886). They are denoted by L‘“(x) and are defined over the 
interval (0,~), n =0,1,2,..., where a> —1. 

The development of these polynomials is based on an application of 
Leibniz formula to finding the nth derivative of the function 


p, (x) =x tne, 


More specifically, for a> —1, L‘*(x) is defined by a Rodrigues-type for- 
mula, namely, 


ae oe) 


L(x) =(-1)"e%x7*% ; 
n ( ) ( ) dx” 


n=0,1,2,.... 


We shall henceforth use L,,(x) instead of L(x). 

From this definition, we conclude that L,,(x) is a polynomial of degree n 
with a leading coefficient equal to one. It can also be shown that Laguerre 
polynomials are orthogonal with respect to the weight function w(x) = e™*x® 
over (0,°), that is, 


f x Lal) L(x) de=0, m#n 
(see Jackson, 1941, page 185). Furthermore, if m =n, then 
J XLA =n (atn Dp neiii 
A function f(x) can be expressed as an infinite series of Laguerre 


polynomials of the form 


o0 


f(x) = Enka 2), 


n=0 
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where 


1 
a nll(atn+t1) 


c f ex La) dx, n=0,1,2,.... 
0 


A recurrence relation for L,,(x) is developed as follows: From the definition 
of L,,(x), we have 
d"(x" + SESK) 


—1)"x%e*L,(x) = 
(=1) x*e™L,(x) a? 


n=0,1,2,.... (10.33) 


Replacing n by n + 1 in (10.33) gives 
nti ght arter) 


n+l a, -x 
(-1) xe Liy+i(*) = dx"! 


(10.34) 


Now, 
besa aca eF, (10.35) 
Applying the Leibniz formula for the (n + 1)st derivative of the product on 


the right-hand side of (10.35) and noting that the nth derivative of x is zero 
for n = 2,3,4,..., we obtain 


q'tt yrtatle—x qt} xrtae-x d” rte x 
EES GONE gg oy ED) 
dx” dx"*! dx” 
d a(n? er*) i u(y er) ime 
= + (n +1) —— 
oa dx” (n ) dx” 2 ( ) 


Using (10.33) and (10.34) in (10.36) gives 
(Sy eb ei) 
d n n 
=*=-|(-1) x“e™L, (x)| +(-1)"(n + 1)x%e"L,(x) 


dL,(x) 


=(-1)"x%e*|(atn+1—x)L,(x) +x FF 


(10.37) 


Multiplying the two sides of (10.37) by (—1)”*1e*x7“, we obtain 


dL,(x) 


Lysi(*) = (x-a@—n—-1L,(x) -x We ct 


n=0,1,2,.... 


This recurrence relation gives L,,,,(x) in terms of L,(x) and its derivative. 
Note that Ly(x) = 1. 
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Another recurrence relation that does not require using the derivative of 
L,,(x) is given by (see Jackson, 1941, page 186) 


Lijai(*) = (x -a-—2n-1)L,(x) -—n(atn)L,_,(*), n=1,2,.... 


10.7. LEAST-SQUARES APPROXIMATION WITH ORTHOGONAL 
POLYNOMIALS 


In this section, we consider an approximation problem concerning a continu- 
ous function. Suppose that we have a set of polynomials, { p,(x)}’.), orthogo- 
nal with respect to a weight function w(x) over the interval [a, b]. Let f(x) 
be a continuous function on [a, b]. We wish to approximate f(x) by the sum 
E’ o ¢;pAx), where Cy, C),...,¢, are constants to be determined by minimiz- 
ing the function 


Wenen) = f Ears) - f(x) | w(x) a, 


a Li=0 


that is, y is the square of the norm, |E; _oc;p;— fllo. If we differentiate y 
with respect to Co, C41,..-,C„ and equate the partial derivatives to zero, we 


obtain 


Z aaf Eanes —f(x)|p(x)w(x)de=0, i=0,1,...,n. 


Hence, 
i b b i 
E |nne dele IOW 0an 
j=0 La a 
(10.38) 
Equations (10.38) can be written in vector form as 
Se =u, (10.39) 


where c=(Cy,C,,---5C,), U= (Uy, Uy,---,U,) with u; = [?f(x)p(x)w(x) dx, 


and S is an (n + 1) x (n + 1) matrix whose (i, j)th element, Sij is given by 
b wk 
s= f Px) p;(x)w(x) dx, i,j=0,1,...,n. 
Since p(x), p,(x),...,p,(x%) are orthogonal, then S must be a diagonal 


matrix with diagonal elements given by 


b s 
siz | p(w) d=],  i=0,1,...57. 
a 
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From equation (10.39) we get the solution c = S~'u. The ith element of c is 
therefore of the form 


(f-Pi) « 
TAn 


i=0,1,...,n. 


For such a value of c, y has an absolute minimum, since S is positive definite. 
It follows that the linear combination 


Pa (x) = È cipi(x) 
=y ig (10.40) 


i=0 |lPill 


minimizes y. We refer to pž(x) as the least-squares polynomial approxima- 
tion of f(x) with respect to po(x), px), ..., p(x). 

If {p,(x)¥P_) is a sequence of orthogonal polynomials, then p*(x) in 
(10.40) represents a partial sum of the infinite series “?_ )[(f:p,),,/ 
llp,llolp,(«). This series may fail to converge point by point to f(x). It 
converges, however, to f(x) in the norm ||-||,,. This is shown in the next 
theorem. 


Theorem 10.7.1. If f:[a,b]— R is continuous, then 


b 
STEE) — Pi) wa) de > 0 
as n > ©, where pž(x) is defined by formula (10.40). 


Proof. By the Weierstrass theorem (Theorem 9.1.1), there exists a polyno- 
mial b,(x) of degree n that converges uniformly to f(x) on [a, b], that is, 


sup |f(x)—-5,(x)| 70 asn>v. 


a<x<b 


Hence, 
JIE ~6,(2) Pw(2) de> 0 


as n > ©, since 


[FY -BO P(x) des sup IFC) =b f(x) de 
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Furthermore, 


JII Wa) des f IAE) bax) wa) de, (10.41) 


since, by definition, p*(x) is the least-squares polynomial approximation of 
f(x). From inequality (10.41) we conclude that || f — pž llo > 0 as n > &. Oo 


10.8. ORTHOGONAL POLYNOMIALS DEFINED ON A FINITE SET 


In this section, we consider polynomials, p(x), p,(x),..., p,(x), defined on a 
finite set D ={x, X5,...,X,} such that a <x; <b, i =0,1,...,n. These poly- 
nomials are orthogonal with respect to a weight function w*(x) over D if 


n 
Y w* (x;) Pm(*;) P,(%;) = 9, m#v; m,v=0,1,...,n. 
i=0 


Such polynomials are said to be orthogonal of the discrete type. For example, 
the set of discrete Chebyshev polynomials, {¢,(j,7)}/), which are defined 
over the set of integers j = 0,1,...,, are orthogonal with respect to w*(j) = 
1, j=0,1,2,...,n, and are given by the following formula (see Abramowitz 
and Stegun, 1964, page 791): 


: c(i) (itk) Te)! 

t(j,n) = =t | í oe 

(in) > ) kj\ k }(j-k)!n! 
i=0,1,... 7; j=0,1,...,n. (10.42) 


For example, for i = 0,1,2, we have 


to(j,n) =1, j=0,1,2,...,n, 


2 
t(j,n)=1- ij, j=0,1,...,7, 


nnil 2] 


6j i—1 
n—1 


n-1 
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A recurrence relation for the discrete Chebyshev polynomials is of the form 


(+ 1)(n—i)tyiCi.n) = (i+ I(n—2j)t(i,n) —i(n+it Yt Gn), 
i=1,2,...,n. (10.43) 


10.9. APPLICATIONS IN STATISTICS 


Orthogonal polynomials play an important role in approximating distribution 
functions of certain random variables. In this section, we consider only 
univariate distributions. 


10.9.1. Applications of Hermite Polynomials 


Hermite polynomials provide a convenient tool for approximating density 
functions and quantiles of distributions using convergent series. They are 
associated with the normal distribution, and it is therefore not surprising that 
they come up in various investigations in statistics and probability theory. 
Here are some examples. 


10.9.1.1. Approximation of Density Functions and Quanitiles of Distributions 


Let (x) denote the density function of the standard normal distribution, 
that is, 


P(x) = EL ORR RS oo, (10.44) 


e 
V2T 


We recall from Section 10.5 that the sequence {H,(x)} -o of Hermite 
polynomials is orthogonal with respect to w(x) = e™/ 2 and that 
ne nil nisl 
H, (x) =x" — —— x"? + x" — xT +, (10.45 
a(x) 2-1! 27-2! 23-3! ( ) 


where n!"!=n(n — 1)(n — 2): (n —r + 1). Suppose now that g(x) is a den- 
sity function for some continuous distribution. We can represent g(x) as a 
series of the form 


g(x) = Yo b,A,(x) 6(x), (10.46) 
n=0 
where, as in formula (10.32), 


1 „æ 
b, = Sf 8) H,(*) dx. (10.47) 
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By substituting H,(x), as given by formula (10.45), in formula (10.47), we 
obtain an expression for b, in terms of the central moments, fo, py,..-, 
Mn>---, Of the distribution whose density function is g(x). These moments 
are defined as 


y= fe (x -— w)"g(x) dx, n=0,1,2,..., 


where u is the mean of the distribution. Note that uọ= 1, u, =0, and 
uW, = 0, the variance of the distribution. In particular, if u = 0, then 


b= 1, 

b, =0, 

b,= z( M,—1), 

bs = 53> 

by = za( Ma — 6m, + 3), 
bs = za( Hs — 10m), 


6 7m ( Me — [Spy + 45 phy — 15), 


The expression for g(x) in formula (10.46) can then be written as 


g(x) = p(x) [1 + 3( u = 1) Aa(x) + zu3H3(x) 
Hoa Ha = 6m, +3)A (x) +]. (10.48) 
This expression is known as the Gram-Charlier series of type A. 
Thus the Gram-Charlier series provides an expansion of g(x) in terms of 
its central moments, the standard normal density, and Hermite polynomials. 


Using formulas (10.21) and (10.46), we note that g(x) can be expressed as a 
series of derivatives of (x) of the form 


g(x) = £ (E) 60. (10.49) 


n=0 


where 


c= CD'S 8(x) A(x) dx, n=0,1,.... 
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Cramér (1946, page 223) gave conditions for the convergence of the series 
on the right-hand side of formula (10.49), namely, if g(x) is continuous and 
of bounded variation on (— œ, %), and if the integral {~*,.g(x) exp(x7/4) dx is 
convergent, then the series in formula (10.49) will converge for every x to 


g(x). 
We can utilize Gram—Charlier series to find the upper a-quantile, x,, of 
the distribution with the density function g(x). This point is defined as 


f“) d=1-a. 
From (10.46) we have that 
g(x) = (x) + Py bhy()0(4). 
Then 
[is em [$09 dx + > b, fH) 6(x) dx. (10.50) 
However, 
JEHA) d= —H, 440) 0%): 


To prove this equality we note that by formula (10.21) 
[°° H,(2) 8(3) de co fEl) w dv 
— %0 n — %0 dx 


-oni 400. 


where (d/dx)"™'ġ(x,) denotes the value of the (n — 1)st derivative of (x) 
at x,. By applying formula (10.21) again we obtain 


ll 


CODD H, (a) $e) 
Hy (ta) (xa). 


[HC 2) (x) ax 


ll 


By making the substitution in formula (10.50), we get 


Xa 


f d= f ad- E btala) (1051) 
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Now, suppose that z, is the upper a-quantile of the standard normal 
distribution. Then 


MEE) dx=1-a= S o) dx. 


Using the expansion (10.51), we obtain 
f ox) &— X b, Ay) A(x) = f ox) d (10.52) 
zg n=2 =n 


If we expand the right-hand side of equation (10.52) using Taylor’s series in a 
neighborhood of x,, we get 


[dae foyer E Sr a Aan) 


2 aes). 


= f“ (x) de+ ier ae DT Hy-1(%4) Ata), 


using formula (10.21) 


ECOL 3 CE H (,) O44) (10.53) 


From formulas (10.52) and (10.53) we conclude that 


Ebh E AA ee iy 1(4,) @(%,)- 


By dividing both sides by ¢(x,) we obtain 


(ra SED H a. (10.54) 


pE EE 


j=1 


This provides a relationship between x,, the a-quantile of the distribution 
with the density function g(x), and z,, the corresponding quantile for the 
standard normal. Since the b,’s are functions of the moments associated with 
g(x), then it is possible to use (10.54) to express x, in terms of z, and the 
moments of g(x). This was carried out by Cornish and Fisher (1937). They 
provided an expansion for x, in terms of z, and the cumulants (instead 
of the moments) associated with g(x). (See Section 5.6.2 for a definition of 
cumulants. Note that there is a one-to-one correspondence between mo- 
ments and cumulants.) Such an expansion became known as the 
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Cornish—Fisher expansion. It is reported in Johnson and Kotz (1970, page 34) 
(see Exercise 10.11). See also Kendall and Stuart (1977, pages 175-178). 


10.9.1.2. Approximation of a Normal Integral 


A convergent series representing the integral 


x a "edt 
M= ae 


was derived by Kerridge and Cook (1976). Their method is based on the fact 
that 


fre) dt =2 2 a 2 Ji fon(=) (10.55) 


for any function f(t) with a suitably convergent Taylor’s expansion in a 
neighborhood of x/2, namely, 


f) = Lal =) (5). (10.56) 


Formula (10.55) results from integrating (10.56) with respect to ¢ from 0 to x 
and noting that the even terms vanish. Taking f(t) =e‘ */2 , we obtain 


(x72) et /?) 
(2n+1)! dt?” 


(10.57) 


fe? dt =2 3 
0 


n=0 t=x/2 


Using the Rodrigues formula for Hermite polynomials [formula (10.21)], we 
get 


d"(e =% “he 
dx" 
By making the substitution in (10.57), we find 


x/2 2n+1 x 
[evra 25 a “2H, (>| 
0 gag Aen ds 2 


=(-1)"e™ 2H (x), n=0,1,.... 


2nt+1 


23 ee! (1/2 
= —x° /8 
a COTE ala) 


(10.58) 


This expression can be simplified by letting ©„(x) =x"H,(x)/n! in (10.58), 
which gives 


0,,,(*/2) 
et ay. dt = xe ™* 278 2n 
fe oe pa 2n+1 
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Hence, 


(10.59) 


Note that on the basis of formula (10.25), the recurrence relation for 0,,(x) is 
given by 


x’ (0, = O,„-1) 
n+1 


i n=1,2,.... 


n+1 


The 0,(x)’s are easier to handle numerically than the Hermite polynomials, 
as they remain relatively small, even for large n. Kerridge and Cook (1976) 
report that the series in (10.59) is accurate over a wide range of x. Divgi 
(1979), however, states that the convergence of the series becomes slower as 
x increases. 


10.9.1.3. Estimation of Unknown Densities 


Let X,, X,,...,X, represent a sequence of independent random variables 
with a common, but unknown, density function f(x) assumed to be square 
integrable. From (10.31) we have the representation 


fay= E oHa), 
j= 


or equivalently, 
o=} a;h;(x), (10.60) 
j=0 


where hx) is the so-called normalized Hermite polynomial of degree j, 
namely, 


1 
— e 
Ga 


and a; = [~..f(x)h(x) dx, since f? „h? (x) dx = 1 by virtue of (10.30). 
Schwartz (1967) considered an estimate of f(x) of the form 


h;(x)= AH; (x),  j=0,1,..., 


q(n) 


f(x) = X 4,.h(x), 
j=0 
where 


1 n 
Gin = 2 h(X;), 
n k=1 
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and q(n) is a suitably chosen integer dependent on n such that q(n) = o(n) 
as n > ©, Under these conditions, Schwartz (1967, Theorem 1) showed that 


f, (x) is a consistent estimator of f(x) in the mean integrated squared error 
sense, that is, 


tim E[[f,(x) — f(x) de=0. 


Under additional conditions on f(x), f,(x) is also consistent in the mean 
squared error sense, that is, 


tim E[ f(x) —f,(x)] =0 


uniformly in x. 


10.9.2. Applications of Jacobi and Laguerre Polynomials 


Dasgupta (1968) presented an approximation to the distribution function of 
X= 4(r + 1), where r is the sample correlation coefficient, in terms of a beta 
density and Jacobi polynomials. Similar methods were used by Durbin and 
Watson (1951) in deriving an approximation of the distribution of a statistic 
used for testing serial correlation in least-squares regression. 

Quadratic forms in random variables, which can often be regarded as 
having joint multivariate normal distributions, play an important role in 
analysis of variance and in estimation of variance components for a random 
or a mixed model. Approximation of the distributions of such quadratic forms 
can be carried out using Laguerre polynomials (see, for example, Gurland, 
1953, and Johnson and Kotz, 1968). Tiku (1964a) developed Laguerre series 
expansions of the distribution functions of the nonnormal variance ratios 
used for testing the homogeneity of treatment means in the case of one-way 
classification for analysis of variance with nonidentical group-to-group error 
distributions that are not assumed to be normal. Tiku (1964b) also used 
Laguerre polynomials to obtain an approximation to the first negative mo- 
ment of a Poisson random variable, that is, the value of E(X~!), where X 
has the Poisson distribution. 

More recently, Schöne and Schmid (2000) made use of Laguerre polyno- 
mials to develop a series representation of the joint density and the joint 
distribution of a quadratic form and a linear form in normal variables. Such a 
representation can be used to calculate, for example, the joint density and 
the joint distribution function of the sample mean and sample variance. Note 
that for autocorrelated variables, the sample mean and sample variance are, 
in general, not independent. 


10.9.3. Calculation of Hypergeometric Probabilities Using Discrete 
Chebyshev Polynomials 


The hypergeometric distribution is a discrete distribution, somewhat related 
to the binomial distribution. Suppose, for example, we have a lot of M items, 
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r of which are defective and M —r of which are nondefective. Suppose that 
we choose at random m items without replacement from the lot (m < M). 
Let X be the number of defectives found. Then, the probability that X =k 
is given by 


(10.61) 


where max(0, m — M +r) < k < min(m, r). A random variable with the proba- 
bility mass function (10.61) is said to have a hypergeometric distribution. We 
denote such a probability function by A(k; m,r, M). 

There are tables for computing the probability value in (10.61) (see,for 
example, the tables given by Lieberman and Owen, 1961). There are also 
several algorithms for computing this probability. Recently, Alvo and Cabilio 
(2000) proposed to represent the hypergeometric distribution in terms of 
discrete Chebyshev polynomials, as was seen in Section 10.8. The following is 
a summary of this work: Consider the sequence {t,(k,m)}"_, of discrete 
Chebyshev polynomials defined over the set of integers k = 0,1,2,...,m [see 
formula (10.42)], which is given by 


1,(k,m) = E (1 


i=0 


Hee 


i) (k—i)\m!’ 
n=0,1,...,.m, k=0,1,...,.m. (10.62) 


Let X have the hypergeometric distribution as in (10.61). Then according to 
Theorem 1 in Alvo and Cabilio (2000), 


È t,(k,m)h(k;m,r,M) =t,(r, M) (10.63) 
k=0 
for all n=0,1,...,m and r=0,1,...,M. Let t, = [t,(0, m), 
t,(1,m),...,¢,0m, mI, n=0,1,...,m, be the base vectors in an (m + 1)- 


dimensional Euclidean space determined from the Chebyshev polynomials. 
Let g(k) be any function defined over the set of integers, k =0,1,...,m. 
Then g(k) can be expressed as 


g(k)= È gata(k.m),  k=0,1,...,m, (10.64) 


n=0 


where g, =g°'t,/llt, ll’, and g=[g(0), g(1),..., g(m)]. Now, using the result 
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in (10.63), the expected value of g(X) is given by 


Elg(X)] È ehm, r, M) 


YD gt, (k,m)h(k;m,r, M) 


n=0 


ll 
TM: 


3 


&, 2 t,(k,m)h(k;m,r,M) 
0 k=0 


n 


3 


g,t,(r,M). (10.65) 
0 


n 


This shows that the expected value of g(X) can be computed from knowl- 
edge of the coefficients g, and the discrete Chebyshev polynomials up to 
order m, evaluated at r and M. 

In particular, if g(x) is an indicator function taking the value one at x =k 
and the value zero elsewhere, then 


E[g(X)] =P(X=k) 
=h(k;m,r,M). 


Applying the result in (10.65), we then obtain 


m t,(k,m) 
h(k;m,r,M) = 2 “ee te (10.66) 
n=0 n 


Because of the recurrence relation (10.43) for discrete Chebyshev polynomi- 
als, calculating the hypergeometric probability using (10.66) can be done 
simply on a computer. 
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EXERCISES 
In Mathematics 


10.1. Show that the sequence {1/ V2 ,cos x, sin x, cos 2.x, sin2x,...,cos nx, 
sin mx,...} is orthonormal with respect to w(x) =1/7 over [— m, m]. 


10.2. Let {p,()_) be a sequence of Legendre polynomials 
(a) Use the Rodrigues formula to show that 
(i) f1,x™p,(x) dx =0 for m =0,1,...,n— 1, 
gut 2n =f 
(ii) [1 x"p,(x) de = ae | a | , n=0,1,2,... 
(b) Deduce from (a) that f1; p,(x)a,_,(x)de=0, where 7z,_,(x) 
denotes an arbitrary polynomial of degree at most equal to n — 1. 
(c) Make use of (a) and (b) to show that f1; p(x) dx =2/(2n + 1}, 
n=0,1,.... 


10.3. Let {T,(x)} -o be a sequence of Chebyshev polynomials of the first 
kind. Let £, = cos[(2i — 1)r/2n], i =1,2,...,n. 
(a) Verify that ¢,,%,...,%, are zeros of T(x), that is, 7,(g) =0, 
i= 1,9 on, 
(b) Show that ¢,, £,...,%, are simple zeros of T(x). 
[ Hint: show that T/(Z,) #0 for i=1,2,...,n.] 


10.4. Let {H,,(x)}"_, be a sequence of Hermite polynomials. Show that 


eae 
a Te Ta), 
d°H, dH, 
(b) 2) -= ni) + nH,(x) = 0. 


ae” 
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10.5. 


10.6. 


10.7. 


10.8. 


10.9. 


Let {T,(x)} -o and {U, (x) -o be sequences of Chebyshev polynomials 

of the first and second kinds, respectively, 

(a) Show that |U,(x)| <n +1 for -1<x<1. 
[ Hint: Use the representation (10.18) and mathematical induction 
on n.] 

(b) Show that |dT (x)/dx| <n? for all —1<x <1, with equality 
holding only if x= F1 (n > 2). 

(c) Show that 


1—x2 


x TÐ U,-1(%) 
I jot =a v 


for —1 <x <1 and n +0. 


Show that the Laguerre polynomial L,(x) of degree n satisfies the 
differential equation 


d°L,,(x) i dL,(x) z 
—— +(a+1- + = 
x dx? (a x) dx n n(X) 
Consider the function 
1 
H(x,f)>———_— 6 2; a>-l. 


(1 =f?" 


Expand H(x,t) as a power series in ¢ and let the coefficient of t” be 
denoted by (—1)"g,(x)/n! so that 


= 


H(x,t) = 2 8n(x)t". 


Show that g,(x) is identical to L,(x) for all n, where L,(x) is the 
Laguerre polynomial of degree n. 


Find the least-squares polynomial approximation of the function f(x) 
=e* over the interval [—1,1] by using a Legendre polynomial of 


degree not exceeding 4. 


A function f(x) defined on —1 <x <1 can be represented using an 
infinite series of Chebyshev polynomials of the first kind, namely, 


fa) =+ Dols), 
n=1 
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where 


-2f POD 
v1 -x? oe 


n=0,1,.... 


This series converges uniformly whenever f(x) is continuous and of 
bounded variation on [—1,1]. Approximate the function f(x) =e* 
using the first five terms of the above series. 


In Statistics 


10.10. 


10.11. 


Suppose that from a certain distribution with a mean equal to zero we 
have knowledge of the following central moments: w,=1.0, m; = 
— 0.91, u, = 4.86, u; = — 12.57, u= 53.22. Obtain an approximation 
for the density function of the distribution using Gram—Charlier series 
of type A. 


The Cornish—Fisher expansion for x,, the upper a-quantile of a 
certain distribution, standardized so that its mean and variance are 


equal to zero and one, respectively, is of the following form (see 
Johnson and Kotz, 1970, page 34): 


Xa=Za 5 (2, 1)k; + 22 = 3z.) Ki 
= 56 22a — D2 q) K3 + (z4 — 62, HS) Ke 
ya (Za — 522 +2) Kg Ky + 35g (1229 — 5322 + 17) k$ 
By — 1022 152, ) K6 
2z — 1723 + 212z,) k3Ks 


— 2423+ 29z,)«? 


a 


1425 — 103z3 + 107z 5 KS K4 


— a775(252z3 — 168823 + 1511z,) ki + +, 


where z, is the upper a-quantile of the standard normal distribution, 
and x, is the the rth cumulant of the distribution (r = 3, 4,...). Apply 
this expansion to finding the upper 0.05-quantile of the central chi- 
squared distribution with n = 5 degrees of freedom. 

[Note: The mean and variance of a central chi-squared distribution 
with n degrees of freedom are n and 2n, respectively. Its rth cumu- 
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10.12. 


10.13. 


10.14. 


lant, denoted by «j, is 
k =n(r—1)!277', r=1,2,.... 


Hence, the rth cumulant, «,, of the standardized chi-squared distribu- 
tion is k, =(2n)7"/*k! (r = 2,3,...).] 


The normal integral fë e~ */? dt can be calculated from the series 
3 5 7 
x P72 4 = x x 
t=x- + -— +e. 
ie s FIA FTA 


(a) Use this series to obtain an approximate value for [łe ~" / dt. 
(b) Redo part (a) using the series given by formula (10.59), that is, 


fenica 9 0,,(x/2) l 
0 K=O 2n+1 


(c) Compare the results from (a) and (b) with regard to the number of 
terms in each series needed to achieve an answer correct to five 
decimal places. 


Show that the expansion given by formula (10.46) is equivalent to 
representing the density function g(x) as a series of the form 


Cn d(x) 
g(x) D n! dx” 2 


where (x) is the standard normal density function, and the c,’s are 
constant coefficients . 


Consider the random variable 
n 
W= 2 X a 
i=1 
where X,, X,,..., X„ are independent random variables from a distri- 


bution with the density function 


à; d(x à, d'p(x X d(x 
NOAE a + a 
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where (x) is the standard normal density function and the quantities 
A, and A, are, respectively, the standard measures of skewness and 
kurtosis for the distribution. Obtain the moment generating function 
of W, and compare it with the moment generating function of a 
chi-squared distribution with n degrees of freedom. (See Example 
6.9.8 in Section 6.9.3.) 

[ Hint: Use Hermite polynomials. ] 


A lot of M = 10 articles contains r = 3 defectives and 7 good articles. 
Suppose that a sample of m = 4 articles is drawn from the lot without 
replacement. Let X denote the number of defective articles in the 
sample. Find the expected value of g(X) =X? using formula (10.65). 


CHAPTER 11 


Fourier Series 


Fourier series were first formalized by the French mathematician Jean- 
Baptiste Joseph Fourier (1768-1830) as a result of his work on solving a 
particular partial differential equation known as the heat conduction equa- 
tion. However, the actual introduction of the so-called Fourier theory was 
motivated by a problem in musical acoustics concerning vibrating strings. 
Daniel Bernoulli (1700-1782) is credited as being the first to model the 
motion of a vibrating string as a series of trigonometric functions in 1748, 
twenty years before the birth of Fourier. The actual development of Fourier 
theory took place in 1807 upon Fourier’s return from Egypt, where he was a 
participant in the Egyptian campaign of 1798 under Napoleon Bonaparte. 


11.1. INTRODUCTION 


A series of the form 


a o0 
st L [a,, cos nx + b, sin nx] (11.1) 


n=1 


is called a trigonometric series. Let f(x) be a function defined and Riemann 
integrable on the interval [— 7, m]. By definition, the Fourier series associ- 
ated with f(x) is a trigonometric series of the form (11.1), where a, and b, 
are given by 


ae 
a, = —f f(x)cosnxdx, nn =0,1,2,..., (11.2) 
TY — 


1 pr 
b, = — i =1,2,.... 11. 
i. JP (a)sin neds, n 25 (11.3) 
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In this case, we write 


a ioe) 
{es + D [4,, cos nx + b, sin nx]. (11.4) 


n=1 


The numbers a, and b, are called the Fourier coefficients of f(x). The 
symbol ~ is used here instead of equality because at this stage, nothing is 
known about the convergence of the series in (11.4) for all x in [—7, m]. 
Even if the series converges, it may not converge to f(x). 

We can also consider the following reverse approach: if the trigonometric 
series in (11.4) is uniformly convergent to f(x) on [— m, 7], that is, 


a oe) 
f= 5.4 [a,, cos nx + b, sin nx], (11.5) 
=1 


n 


then a, and b, are given by formulas (11.2) and (11.3). In this case, the 
derivation of a, and b, is obtained by multiplying both sides of (11.5) by 
cos mx and sin nx, respectively, followed by integration over [— m, m]. More 
specifically, to show formula (11.2), we multiply both sides of (11.5) by cos nx. 
For n #0, we then have 


a ioe) 

f(x)cos nx = 5 cos nx+ }, [a, cos kx cos nx + b, sin kx cos nx] dx. (11.6) 
k=1 

Since the series on the right-hand side converges uniformly, it can be 

integrated term by term (this can be easily proved by applying Theorem 6.6.1 

to the sequence whose nth term is the nth partial sum of the series). We then 

have 


T d Ag pT d 
J Feos nxdx = zJ osm 
+) i" ap cos kx cos nede + f” b, sin kx cos nxdx |. 
k=1 L” =7 =r 
(11.7) 
But 
| aonn AEE (11.8) 
a sin kx cos nxdx = 0, (11.9) 
a _J0, kn, 
J7 cos kecosnede= {%9 ee (11.10) 


INTRODUCTION 473 


From (11.7)—(11.10) we conclude (11.2). Note that formulas (11.9) and (11.10) 
can be shown to be true by recalling the following trigonometric identities: 


sin kx cos nx = 3{sin[(k +n)x] + sin[(k—7)x]}, 


cos kx cos nx = 5{cos[(k +n) x] + cos[(k —7n)x]}. 


For n = 0 we obtain from (11.7), 


[fe dx =ayT. 


Formula (11.3) for b, can be proved similarly. We can therefore state the 
following conclusion: a uniformly convergent trigonometric series is the 
Fourier series of its sum. 


Note. If the series in (11.1) converges or diverges at a point x,, then it 
converges or diverges at x) + 2nm (n = 1,2,...) due to the periodic nature of 
the sine and cosine functions. Thus, if the series (11.1) represents a function 
f(x) on [—7,7], then the series also represents the so-called periodic 
extension of f(x) for all values ofx. Geometrically speaking, the periodic 
extension of f(x) is obtained by shifting the graph of f(x) on [— m, m] by 
27,47,... to the right and to the left. For example, for —37 <x < —7, f(x) 
is defined by f(x+27), and for 7<x <3z, f(x) is defined by f(x — 27), 
etc. This defines f(x) for all x as a periodic function with period 27. 


EXAMPLE 11.1.1. Let f(x) be defined on [— 7, 7] by the formula 


0, -m<x<0, 
= x 
Le as eT | eee 
T 


Then, from (11.2) we have 
1 pr i 
a, = zf Sosa 


1 pr 
= =f x cos nxdx 
m“ o 


aed ee: 
— sin nxdx 
0 n J | 


1 
Sem ee a 


1 | xsin nx 
= 


n 
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Thus, for n = 1,2,..., 


Also, from (11.3), we get 
1 

b, = — 

T 


1 ood 
=—/ x sin nx dx 
T “0 


f f(x)sin nxdx 


Ty i 


=| X COS NX 
z|- 


The Fourier series of f(x) is then of the form 


n=1 (2n = 1) 


EXAMPLE 11.1.2. Let f(x) =x*(—a7<x< r). Then 


ae 

a= —] xd 
Tlr 
2T? 


1 
= cos[(2n — 1)x] — m 
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Lem, 
a=- f x* cos nx dx 
M -r 


T 


x? sin nx 2 om 

= — =|. x sin nx dx 
mn sm | ONS 
2 m 

= — — x sin nx dx 
TNs 7 


a 2 ae is 
= — — cos nx 
mn? |i = 
4 cos nar 
n? 
4(—1)" 
ee n=1,2,..., 
n 
1 od 
b,=—{ x? sin nx dx 
TY — a 
=0, 


since x? sin nx is an odd function. Thus, the Fourier expansion of f(x) is 


T? cos2x  cos3x 


2 ed Ere A Es tg Ae 
x 3 [cos x 722 32 


11.2. CONVERGENCE OF FOURIER SERIES 


In this section, we consider the conditions under which the Fourier series of 
f(x) converges to f(x). We shall assume that f(x) is Riemann integrable on 
[—7, 7]. Hence, f?(x) is also Riemann integrable on [— m, m] by Corollary 
6.4.1. This condition will be satisfied if, for example, f(x) is continuous on 
[—7, 7], or if it has a finite number of discontinuities of the first kind (see 
Definition 3.4.2) in this interval. 

In order to study the convergence of Fourier series, the following lemmas 
are needed: 


Lemma 11.2.1. If f(x) is Riemann integrable on [— 7, m], then 
lim f f(x)cos nxdx = 0, (11.11) 


lim g f(x)sin nx dx = 0. (11.12) 
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Proof. Let s,(x) denote the following partial sum of Fourier series of 


f(x): 


s(x) = 2y + }_ [a, cos kx + b, sin kx], (11.13) 
k=1 


where a, (k=0,1,...,”) and b, (k=1,2,...,m) are given by (11.2) and 
(11.3), respectively. Then 


J FOs, (x) de= 2 [7 Od 


+ 2 af? feos kxdx + bif f(a)sin kxa| 


2 n 


It can also be verified that 


E s2(2) d= ho (a2 +2). 


k=1 


Consequently, 
[fe —sy(x)] de= f” f(x) de 2f" asda) det f” six) de 


= f7 Poa- |Z E (a+b), 


k=1 
It follows that 


2 
ag 


n 1 pa 
+ E (&+bi)s=f P(x) d. (11.14) 


Eg k 
2 k=1 


Since the right-hand side of (11.14) exists and is independent of n, the series 
7 (a? + bz) must be convergent. This follows from applying Theorem 5.1.2 
and the fact that the sequence 


is bounded and monotone increasing. But, the convergence of Lj_,(az + bz) 
implies that lim, _,.(az+bz)=0 (see Result 5.2.1 in Chapter 5). Hence, 
lim, a, = 0, lim; „sb; = 0. Oo 
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Corollary 11.2.1. If (x) is Riemann integrable on [— 7, m ],then 


lim ie (x)sin[(n + 4)x] de =0. (11.15) 


Proof. We have that 
x x 
#(x)sin[(n + 3)x] = | (eos [sin nx + | #(sin5 Joos nx. 
Let 6,(x) = f(x)sin(x/2), p(x) = b(x)cos(x/2). Both (x) and ¢,(x) are 


Riemann integrable by Corollary 6.4.2. By applying Lemma 11.2.1 to both 
(x) and p(x), we obtain 


lim F ¢,(«)cos nxdx = 0, (11.16) 
lim f" ġ(x)sin ndr = 0. (11.17) 
Formula (11.15) follows from the addition of (11.16) and (11.17). o 


Corollary 11.2.2. If (x) is Riemann integrable on [— m, m], then 
lim f o(x)sin|(n + 5)x] dx =0, 


lim J oosina +)x] de=0. 


Proof. Define the functions h,(x) and h,(x) as 


0, O<x<7, 


Hie) ree —m7<x <0, 
ZOR O<x< T, 


0, -m<x<0. 


Both A(x) and h,(x) are Riemann integrable on [- m,m]. Hence, by 
Corollary 11.2.1, 


lim J? o(2)sin[(n + 3)x] d= lim JT hsCasin[(n +4)x] dx 
=0 
lim ['o(2)sin[(n +3)x| dx= lim [7 ba(a)si[(n + 3)x] de 


=0. o 
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Lemma 11.2.2. 


1 #2 i +5 
z+ È cos ku = sallet aji (11.18) 


kel 2sin(u/2) 


Proof. Let G,,(u) be defined as 


n 


1 
G,(u) = zt 2. cos ku. 
k=1 


Multiplying both sides by 2sin(u/2) and using the identity 
2 u c 1 . 1 
2sin 5 cos ku = sin[(k + 4)u] —sin[(k —3)u], 
we obtain 
u u k 
2sin = G,(u) = sin 5 + © {sin[(k + 3)u] —sin[(k — 5)u]} 
k=1 
= sin[(n + 5)u]. 
Hence, if sin(u /2) # 0, then 


sin|(n + ż)u] 
2sin(u/2) 


G,(u) = 


Theorem 11.2.1. Let f(x) be Riemann integrable on [— 7, 7], and let it 
be extended periodically outside this interval. Suppose that at a point x, f(x) 
satisfies the following two conditions: 


i. Both f(x~) and f(x") exist, where f(x~) and f(x*) are the left-sided 
and right-sided limits of f(x), and 


f(x) = lf) +f"). (11.19) 


ii. Both one-sided derivatives, 


+h) = f(x" 
(ea ee 


f(x+h) - fr) 
- , 


1 =W li 
LARIS, m 


exist. 


CONVERGENCE OF FOURIER SERIES 479 
Then the Fourier series of f(x) converges to f(x) at x, that is, 


f(x) if x is a point of continuity, 
Sade? s[f(x*) +f(x")] if x is a point of discontinuity of the first kind. 


Before proving this theorem, it should be noted that if f(x) is continuous 
at x, then condition (11.19) is satisfied. If, however, x is a point of disconti- 
nuity of f(x) of the first kind, then f(x) is defined to be equal to the 
right-hand side of (11.19). Such a definition of f(x) does not affect the values 
of the Fourier coefficients, a, and b,, in (11.2) and (11.3). Hence, the Fourier 
series of f(x) remains unchanged. 


Proof of Theorem 11.2.1. From (11.2), (11.3), and (11.13), we have 
1 pr n |1 7 
s(x)= O, dt È ZJ Focos krat Joos x 


ian 
A (s f(t)sin kude |sin r! 


1 pr ñ 
= a f@ls+ Xd (cs ok + sin sin) 
TRSN k=1 
1 pr n 
= a f@ls+ dX cos k(t =). 
Ee k=1 


Using Lemma 11.2.2, s,(x) can be written as 


lor sin| (n +3)(t-x 
s,(x) = el fo SG A jä 


If we make the change of variable t — x = u in (11.20), we obtain 


(11.20) 


sin[(n + 3)u| 


2 sin(u/2) ae 


(x)= =f" fete) 


Since both f(x+u) and sin[(n + 5)u]/[2sin(u/2)] have period 27 with 
respect to u, the integral from —m—x to m—x has the same value as the 
one from — 7 to m. Thus, 


sin[(n + 3)u| 


PAOA du. (11.21) 


(x)= af fetu) 
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We now need to show that for each x, 

fim s,(x) = 4O) +f0° J]. 
Formula (11.21) can be written as 


1 0 sin} (n 4 u 
S,(X) = al fete) alte el | du 


z alek 2)u ul y 
r= Arsu) 2sin(u/2) 


sin[(7 + z)u] 


2sin(u/2) n 


= f [Fæ +u) -f(x x)|? 


o sin|(n+ż)u] 


PA] =f 2sin(u/2) 


sin|(n + 5)u| Be 


— ze 
Pe IG Ns cay) 


~ sin|( nt+4 ie 


+ f(x") -f wae (11.22) 


The first integral in (11.22) can be expressed as 


sin|(n + 5)u| i 


1 0 E 
zL Meto fF] EAA 


1 po f(x+u)- f(x) u 
= =| 


ts E Zsin(u/2) sin|(n + ż)u] du. 


We note that the function 


ftu) -f() u 
u 2sin(u/2) 


is Riemann integrable on [— 7,0], and at u = 0 it has a discontinuity of the 
first kind, since 


lim ee =f'(x7), and 


u>07 u 


u 
lim —— z =l, 
u>0~ 2sin(u/2) 
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that is, both limits are finite. Consequently, by applying Corollary 11.2.2 to 
the function {[ f(x + u) — f(x )] /u}. u/[2 sin(u/2)], we get 


1 0 
lim =f [fœ +u) -f07)] 


noo M — 


sin|(n + 3)u] 
2sin(u/2) 


du=0. (11.23) 


We can similarly show that the third integral in (11.22) has a limit equal to 
zero as n > ©, that is, 


1 pa sin|(n + )u 
lim a [f(x+u) —f(x*)] (n+) en (11.24) 


n>% T 2 sin (u/2) 


Furthermore, from Lemma 11.2.2, we have 


l sin[(n + ż)u] Bar 


-r 2sin(u/2) D 


1 n 
+ }, cos kw du 
= 


uf 11.25 
zi (11.25) 


xsin[(n +ż)u] ofl g 
f Zsin(u/2 du f, ; Pons sk du 


2 11.26 
=a. (11.26) 


From (11.22)—(11.26), we conclude that 


lim s(x) SOD 4/7]. 0 


Definition 11.2.1. A function f(x) is said to be piecewise continuous on 
La, b] if it is continuous on [a, b] except for a finite number of discontinuities 
of the first kind in [a, b], and, in addition, both f(a*) and f(b~) exist. oO 


Corollary 11.2.3. Suppose that f(x) is piecewise continuous on [— 7, 7], 
and that it can be extended periodically outside this interval. In addition, if, 
at each interior point of [— m, m], f’(x*) and f(x") exist and f’(—7*) and 
f'() exist, then at a point x, the Fourier series of f(x) converges to 


alf) + faxt]. 


Proof. This follows directly from applying Theorem 11.2.1 and the fact 
that a piecewise continuous function on [a, b] is Riemann integrable there. 
m 
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EXAMPLE 11.2.1. Let f(x) =x (—a<x <7). The periodic extension of 
f(x) (outside the interval [— m, z]) is defined everywhere. In this case, 


1 pr 
a,=—{ x cos nxdx = 0, n=0,1,2,..., 
MT- r 


1 pr 
b,=—f x sin nx dx 
TY r 


2 =i n+1 
ead. 
n 
Hence, the Fourier series of f(x) is 
0 =| REI 
~D zi > —— sin nx. 


n 


This series converges to x at each x in (— m,m). At x=-— m,m we have 
discontinuities of the first kind for the periodic extension of f(x). Hence, at 
x = m, the Fourier series converges to 


aL f(a) +E) =2[7+ (-7)] 


=0. 
Similarly, at x = — 77, the series converges to 
alf =m) fer) =slr+ (—7)] 
=0. 


For other values of x, the series converges to the value of the periodic 
extension of f(x). 


EXAMPLE 11.2.2. Consider the function f(x)= x? defined in Example 
11.1.2. Its Fourier series is 


=i 


[ee] 
T 
x? rege 4d cos nx. 


n 


The periodic extension of f(x) is continuous everywhere. We can therefore 
write 


ai 


T o 
y= at 4d cos nx. 
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In particular, for x = +7, we have 


2 a 4 x 
= — + ; 
g 3 on. oe 
T? x 1 
6 eee n? 


11.3. DIFFERENTIATION AND INTEGRATION 
OF FOURIER SERIES 


In Section 11.2 conditions were given under which a function f(x) defined on 
[- m, m] is represented as a Fourier series. In this section, we discuss the 
conditions under which the series can be differentiated or integrated term by 
term. 


Theorem 11.3.1. Let a)/2+ X7- [acos nx + b,sin nx] be the Fourier 
series of f(x). If f(x) is continuous on [— m, m], f(— 7) = f(r), and f'(x) is 
piecewise continuous on [— 7, 7], then 


a. at each point where f”(x) exists, f'(x) can be represented by the 
derivative of the Fourier series of f(x), where differentiation is done 
term by term, that is, 


f'(x) = ¥ [nb, cos nx — nasin nx]; 
n=1 
b. the Fourier series of f(x) converges uniformly and absolutely to f(x) 
on [— 7, T]. 


Proof. 


a. The Fourier series of f(x) converges to f(x) by Corollary 11.2.3. Thus, 


a o0 
f(x) = n Z [a,, cos nx + b, sin nx]. 


n=1 
The periodic extension of f(x) is continuous, since f(z) =f(— 7). 
Furthermore, the derivative, f'(x), of f(x) satisfies the conditions of 


Corollary 11.2.3. Hence, the Fourier series of f'(x) converges to f'(x), 
that is, 


a ioe) 
[oa [ a, cos nx + B, sin nx], (11.27) 
=1 


n 
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where 
LP pias 
w= f(s) 


1 
==If() -f(-7)l 
=0, 


lpr P 
a, = =e (x)cos nx 
1 m N pw 
= —f(x)cos nx|_, + aa f(x)sin nx dx 
T mE 


Ta “~—_[ f(r) —f(—7)] +nb, 


= on 


1 pr 
B, = =i f'(x)sin nx dx 
TY — r 


1 m n T 
= zf)sin nx|_„— =} f(x) cos nx dx 


= —na,. 


By substituting a), a,, and B, in (11.27), we obtain 


f'(x) = ¥& [nb, cos nx — na, sin nx]. 
n=1 


b. Consider the Fourier series of f'(x) in (11.27), where ay =0, a, =nb,, 
B, = —na,,. Then, using inequality (11.14), we obtain 


S 1 T 
E (a tB) SgS OP ae. (11.28) 


Inequality (11.28) indicates that £2_;(a? + B?) is a convergent series. 
Now, let s,(x) = a)/2 + Xz; [a, cos kx + b, sin kx]. Then, for 
n>m+1, 


s =s0l=], 3 [a, cos kx + b, sin kx] 
k=m+1 
< Yi ja,cos kx +b, sin kx|. 


k=m+1 
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Note that 


|a cos kx + b; sin kx| < (az +b2)”. (11.29) 


Inequality (11.29) follows from the fact that a, cos kx + b, sin kx is the 
dot product, u-v, of the vectors u = (a,, b,)', v = (cos kx, sin kx)’, and 
lu -v| < llull lvl|2 (see Theorem 2.1.2). Hence, 


n 


|s,(x)—s,(x)|< L (a? + b2)? 


k=m+1 


n 1 A aif n 1/2 n i ; 1/2 
E glares Ep] | E (arses) 
m+1 k 


k= =m+1 


and by (11.28), 


n 


1 T 
E (+B) sz O a 


k=m+1 


In addition, Y;_,1/k* is a convergent series. Hence, by the Cauchy 
criterion, for a given e> 0, there exists a positive integer N such that 
El m41 1/k* < €? if n>m >N. Hence, 


|s,(x)—-s,(x)|< YS la, cos kx + b, sin kx| 
k 


=m+1 


<CE, ifm>n>N, (11.31) 


where c={(1/7)/7,[f’(x)P dx}. The double inequality (11.31) 
shows that the Fourier series of f(x) converges absolutely and uni- 
formly to f(x) on [— m, m] by the Cauchy criterion. o 


Note that from (11.30) we can also conclude that L7_ ,(az + bz)!” satisfies 
the Cauchy criterion. This series is therefore convergent. Furthermore, it is 
easy to see that 


1/2 


È (al + ls X [Cai +28) 
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This indicates that the series U7_,(Ja,| + |b,|) is convergent by the compari- 
son test. 

Note that, in general, we should not expect that a term-by-term differenti- 
ation of a Fourier series of f(x) will result in a Fourier series of f'(x). For 
example, for the function f(x) =x, —a<x <7, the Fourier series is (see 
Example 11.2.1) 


a 2(-1)"*" 
x~ }, ———sin nx. 
n=1 n 


Differentiating this series term by term, we obtain L%_,2(—1)"*'cos nx. 
This, however, is not the Fourier series of f'(x) = 1, since the Fourier series 
of f’(x)=1 is just 1. Note that in this case, f(m) #f(— 7), which violates 
one of the conditions in Theorem 11.3.1. 


Theorem 11.3.2. If f(x) is piecewise continuous on [— m, 7] and has the 
Fourier series 


a ioe) 
f(x) ~ S + }, [a,, cos nx + b, sin nx], (11.32) 


n=1 


then a term-by-term integration of this series gives the Fourier series of 
{*..f() dt for x €[- m, m], that is, 


ay(m+x 2 [a b 
AETA, L | —sin nx — — (cos nx — cos nz) |, 
2 nail” n 


x 
f f(jae= 
-WSXST. 
Furthermore, the integrated series converges uniformly to /{* , f(t) dt. 


Proof. Define the function g(x) as 


a(x) = f? (Qa Sy, (11.33) 


If f(x) is piecewise continuous on [— m, m], then it is Riemann integrable 
there, and by Theorem 6.4.7, g(x) is continuous on [— 7, m]. Furthermore, 
by Theorem 6.4.8, at each point where f(x) is continuous, g(x) is differen- 
tiable and 


&'(x) =f(*) - =. (11.34) 
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This implies that g'(x) is piecewise continuous on [—7, 7]. In addition, 
g(— 7) =g(77). To show this, we have from (11.33) 


s(-m) =f "Madr Sa 


ay 
=—T. 
2 
T ag 
a(m)=f Sd- 
ag 
= ag — piu 
ay 
— ~T, 
2 


by the definition of a). Thus, the function g(x) satisfies the conditions of 
Theorem 11.3.1. It follows that the Fourier series of g(x) converges uni- 
formly to g(x) on [— m, m]. We therefore have 


g(x) = ay 3 [ A,, cos nx + B, sin nx]. (11.35) 


Moreover, by part (a) of Theorem 11.3.1, we have 
g(x) = ¥ [nB, cos nx — nA, sin nx]. (11.36) 
n=1 


Then, from (11.32), (11.34), and (11.36), we obtain 


Substituting in (11.35), we get 


Ay 
g(x) = ay 


a 
+} |-? — cos nx + “sin me, 
n 


= 


From (11.33) we then have 


x aox Ag b, an. 
f f(t) dt = — gi = = cos nx + — sin nx |. (11.37) 


n=1 
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To find the value of A,, we set x = —7 in (11.37), which gives 


aom Ay i b, 
0=-——+—+ E (- Zesar), 
n 


n=1 


Hence, 


o0 
Ao a. ” cos 
— = — COs nT. 
2 2 =, 1 


n 


Substituting 4/2 in (11.37), we finally obtain 


[ae 


z an . b, 
+ }, | — sin nx — — (cos nx — cos nr ) |. o 
n=1 L” n 


11.4. THE FOURIER INTEGRAL 


We have so far considered Fourier series corresponding to a function defined 
on the interval [— 7, m]. As was seen earlier in this chapter, if a function is 
initially defined on [— m, m], we can extend its definition outside [— m, m ] by 
considering its periodic extension. For example, if f(—7)=f(m), then we 
can define f(x) everywhere in (—%, œ) by requiring that f(x + 27) = f(x) for 
all x. The choice of the interval [— m, m] was made mainly for convenience. 
More generally, we can now consider a function f(x) defined on the interval 
[—c,c]. For such a function, the corresponding Fourier series is given by 


ao = NTX _ [NTX 
aT D |a, cos( ==) +b, sin“). (11.38) 
where 
1 pc NTX 
a= fi f(x)oos{ E] de, n=0,1,2,..., (11.39) 
1 pc . [PTX 
by = fi fx)sin{ “) ae, n=1,2,.... (11.40) 


Now, a question arises as to what to do when we have a function f(x) that 
is already defined everywhere on (—~,°), but is not periodic. We shall show 
that, under certain conditions, such a function can be represented by an 
infinite integral rather than by an infinite series. This integral is called a 
Fourier integral. We now show the development of such an integral. 

Substituting the expressions for a, and b, given by (11.39), (11.40) into 
(11.38), we obtain the Fourier series 


xf 10 d+ 3 J Oo (4-2) a. (11.41) 
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If c is finite and f(x) satisfies the conditions of Corollary 11.2.3 on [—c, c], 
then the Fourier series (11.41) converges to 5[ f(x”) + f(x*)]. However, this 
series representation of f(x) is not valid outside the interval [—c, c] unless 
f(x) is periodic with the period 2c. 

In order to provide a representation that is valid for all values of x when 
f(x) is not periodic, we need to consider extending the series in (11.41) by 
letting c go to infinity, assuming that f(x) is absolutely integrable over the 
whole real line. We now show how this can be done: 

As c >, the first term in (11.41) goes to zero provided that /~..f(t) dt 
exists. To investigate the limit of the series in (11.41) as c —>œ, we set 


A, = T/c, 4, = 27/c,...,r, =nt/c,..., AA, =A,4, —A, = T/C, n= 
1,2,... . We can then write 
1 S c nT d 1 EN, c `; 4 
= t = (k= t=— t t— t. 
z E J poeos| Tea) ]de= = E aa f Oosa] 


(11.42) 


When c is large, AA, is small, and the right-hand side of (11.42) will be an 
approximation of the integral 


1 o roel 
>f {f f()eos[ a(e—x)] at} dà. (11.43) 
TT “0 —00 
This is the Fourier integral of f(x). Note that (11.43) can be written as 
f La(aycos Ax + b(A)sin Ax] dA, (11.44) 
0 
where 
1 „œ 
a( A) = -f f(t)cos àtdt, 
TT ¥ — o 


1 „œ 
b= f. f(t)sin àtdt. 


The expression in (11.44) resembles a Fourier series where the sum has been 
replaced by an integral and the parameter A is used in place of the integer n. 
Moreover, a(A) and b(A) act like Fourier coefficients. 

We now show that the Fourier integral in (11.43) provides a representa- 
tion for f(x) provided that f(x) satisfies the conditions of the next theorem. 


Theorem 11.4.1. Let f(x) be piecewise continuous on every finite inter- 
val [a,b]. If {= .|f(x)| dx exists, then at every point x(—% <x < œ) where 
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f'(x*) and f'(x7) exist, the Fourier integral of f(x) converges to 4h f(x7) + 
f(x*)], that is, 


= f° { Osta —x)] dihan =1[f(x7) +f(x*)]. 


The proof of this theorem depends on the following lemmas: 


Lemma 11.4.1. If f(x) is piecewise continuous on [a, b], then 
; b : 
lim J f(x)sin nxdx = 0, (11.45) 
f b 
lim f f(x)cos nxdx = 0. (11.46) 


Proof. Let the interval [a,b] be divided into a finite number of subinter- 
vals on each of which f(x) is continuous. Let any one of these subintervals be 
denoted by [p,q]. To prove formula (11.45) we only need to show that 


lim ['fCa)sin nxdx = 0. (11.47) 
p 


n>% 


For this purpose, we divide the interval [ p, q] into k equal subintervals using 
the partition points x) =p, X),X,...,X,=q. We can then write the integral 
in (11.47) as 


a Xi+1 
J f(x)sin nx dx, 


i=0 ži 


or equivalently as 
kz Xi+1 Xi+1 
ye {feof sin nxdx + f [ f(x) - FC) ]sin nrd). 
i=0 Xi Xi 


It follows that 


k-1 


<} 


i=0 


COS NX; — COS NX; 41 


[resin nx dx f(x) 


kzl Xi+1 
+E SO IFO fed ae. 
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Let M denote the maximum value of |f(x)| on [p,q]. Then 


[resin nx dx 
P 


2 Mk k-1 Xi+1 
a Xf fe) fC) [de (11.48) 
i=0 Xi 


Furthermore, since f(x) is continuous on [p,q], it is uniformly continuous 
there [if necessary, f(x) can be made continuous at p,q by simply using 
f(p*) and f(q7~) as the values of f(x) at p,q, respectively]. Hence, for a 
given e> 0, there exists a ô> 0 such that 


KED) Sal S p] (11.49) 


if |x, —x,| <6, where x, and x, are points in [p,q]. If k is chosen large 
enough so that |x;,, —x,;| < ô, and hence |x—x,| <6 if x; <x <x,,,, then 
from (11.48) we obtain 


q ; 2Mk € KoT eg 
J Fin med < J + NTD) L J dx, 
or 
q ; 2Mk eœ 

|J Asin vas < as + > 
since 

k-1 oy k-1 

E f de = (Sia 4) 

i=0 "X; i=0 

=q—Pp. 


Choosing n large enough so that 2Mk/n < €/2, we finally get 


<e. (11.50) 


[f(x)sin nx dx 
P 


Formula (11.45) follows from (11.50), since e> 0 is arbitrary. Formula (11.46) 
can be proved in a similar fashion. o 


Lemma 11.4.2. If f(x) is piecewise continuous on [0,b] and f’(0*) 
exists, then 


sin nx 


lim [IOA O). 


x 
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Proof. We have that 


b sin n. bS wi 
x A tiem 0* 
EO 4-10) [ 
x 0 
[OHO ay nx dx. (11.51) 
0 
But 
p sin nx bn SiN x 
lim dx = lim dx 
Jim f x Jim f x 
coo SIN x T 
=f dx= — 
0 x 2 


(see Gillespie, 1959, page 89). Furthermore, the function (1/x)[ f(x) — f0*)] 
is piecewise continuous on [0, b], since f(x) is, and 


im, OAO peor, 
which exists. Hence, by Lemma 11.4.1, 
im PLOY) 


nw 0) 


From (11.51) we then have 


sin nxdx = 0. 


tim f F(x) E ds = Ego. D 


n>% 0 


Lemma 11.4.3. If f(x) is piecewise continuous on [a,b], and f’(x9), 
f'(xG) exist at x9, a <x) <b, then 


sin| n(x —X) | 


im ffx) de= > [F(0) +f25)]. 


noo X— Xo 


Proof. We have that 


5 sin[a(x—xo)] xo sin[ a(x —xo)] 
O ra LG ee roa 
«fey ME 
= -je pier =, sin TE 


b- sin nx 
+f “FOR 1 
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Lemma 11.4.2 applies to each of the above integrals, since the right-hand 


derivatives of f(x)—x) and f(x, +x) at x=0 are —f'(x,) and f'(xğ), 
respectively, and both derivatives exist. Furthermore, 


lim f(xy—x) =f(25) 
and 
lim, f(x +x) =f(x$). 
x>0+t 
It follows that 


lita zo 


tim f F(x) ds FD +f3)]. E 


Proof of Theorem 11.4.1. The function f(x) satisfies the conditions of 
Lemma 11.4.3 on the interval [a, b]. Hence, at any point x, a <x <b, where 
f'(x) and f'(xğ) exist, 
sin[ A(t —x)] 

t == 


tim f° F(0) dt= Ef) +f(x+)]. (11.52) 


Let us now partition the integral 


alla dia Hig 


TA FO 
= in mc pa Uae + fro ae x)| a 
ef f(t jee (11.53) 


From the first integral in (11.53) we have 


dt. 


IEG fa ahai x)] al< fi KOI 


-o |t—x| 
Since t <a and a <x, then |t—x| >x —a. Hence, 


i ON 


o |t =x | 


t< em | f(t) ldt. (11.54) 
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The integral on the right-hand side of (11.54) exists because /~,.|f(¢)| dt 
does. Similarly, from the third integral in (11.53) we have, if x <b, 


dt 


eo A |. Po 


|t—x| 
< sof ole 


1 œ 
Seri MOC 


Hence, the first and third integrals in (11.53) are convergent. It follows that 
for any e> 0, there exists a positive number N such that if a< -N and 
b >N, then these integrals will each be less than €/3 in absolute value. 
Furthermore, by (11.52), the absolute value of the difference between the 
second integral in (11.53) and the value (7/2)[ f(x) + f(x*)] can be made 
less then €/3, if A is chosen large enough. Consequently, the absolute value 
of the difference between the value of the integral J and (7/2)[ f(x) + f(x*)] 
will be less than e, if A is chosen large enough. Thus, 


sin[ A(t —x)] 
t— 


lim f” fC) dt= S1f07) +f0")]. (11.55) 


The expression sin[ A(t — x)]/(t — x) in (11.55) can be written as 


sin| A(t —x)] 


t—-x 


= [cosla(t—x)] da. 


Formula (11.55) can then be expressed as 
1 f(x-) +f(x*)] = = itn f f(t) dt f*cos[a(t-x)]da 
a T A> lo 0 


1 À 0. 
=— lim daf  f(t)cos[ a(t —x)] dt. (11.56) 


T à> “9 


The change of the order of integration in (11.56) is valid because the 
integrand in (11.56) does not exceed |f(t)| in absolute value, so that the 
integral [2 f(t)cos[ a(t —x)] dt converges uniformly for all a (see Carslaw, 
1930, page 199; Pinkus and Zafrany, 1997, page 187). From (11.56) we finally 
obtain 


s[ f(x) +f(x*)] = ZEE Oosa] a) da. m 
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11.5. APPROXIMATION OF FUNCTIONS BY TRIGONOMETRIC 
POLYNOMIALS 


By a trigonometric polynomial of the nth order it is meant an expression of 
the form 


Me 


a 
BSS t [ a, cos kx + B, sin kx]. (11.57) 


k 


1 


A theorem of Weierstrass states that any continuous function of period 27 
can be uniformly approximated by a trigonometric polynomial of some order 
(see, for example, Tolstov, 1962, Chapter 5). Thus, for a given e> 0, there 
exists a trigonometric polynomial of the form (11.57) such that 


Fœ) —4,(4)|<e 


for all values of x. In case the Fourier series for f(x) is uniformly convergent, 
then ¢,(x) can be chosen to be equal to s„(x), the nth partial sum of the 
Fourier series. However, it should be noted that t(x) is not merely a partial 
sum of the Fourier series for f(x), since a continuous function may have a 
divergent Fourier series (see Jackson,1941, page 26). We now show that s,(x) 
has a certain optimal property among all trigonometric polynomials of the 
same order. To demonstrate this fact, let f(x) be Riemann integrable on 
[— m, 7], and let s,(x) be the partial sum of order n of its Fourier series, that 
is, 5,(x) =a) /2 + Le_,la, cos kx + b, sin kx]. Let r(x) =f(x) — 5, (x). Then, 
from (11.2), 


F f(x)cos kxdx = 4 S„(x)cos kx dx 
= Ta}, k=0,1,...,n. 


Hence, 


f” r()coskede=0 for k <n. (11.58) 
We can similarly show that 


f” r Cx)sin kede=0 for k <n. (11.59) 
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Now, let u,(2) = t,(2) = 5,(2), where t(x) is given by (11.57). Then 
[7 FG) 4) P de = [7 [r)a] de 
= f7 Ra) de 2f" rau) de + f7 (a) ae 
5 Fe) —s,(x)]° dx + f hla) dx, (11.60) 


since, by (11.58) and (11.59), 


JE rau) dx=0. 


From (11.60) it follows that 


S EO- de> f O de. 6n 


This shows that for all trigonometric polynomials of order n, fZ ,[f(x)-— 
t(x) dx is minimized when 1,(x) =s,(x). 


11.5.1. Parseval’s Theorem 


Suppose that we have the Fourier series (11.5) for the function f(x), which is 
assumed to be continuous of period 27. Let s„(x) be the nth partial sum of 
the series. We recall from the proof of Lemma 11.2.1 that 


FUO- a f(y de— | +E (ap +09)). 11.0) 


k=1 


We also recall that for a given e> 0, there exists a trigonometric polynomial 
t,(x) of order n such that 


[f(2) —t,(4)|<e. 


Hence, 


f° LA) —1,(x)]° de < 2re?. 
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Applying (11.61), we obtain 


J EO P des f [Fe 007 ax 


<27e?. (11.63) 


Since e> 0 is arbitrary, we may conclude from (11.62) and (11.63) that the 
limit of the right-hand side of (11.62) is zero as n > %, that is, 


1 = 2 oo 
-f Hogas y YX (ag + bf). 
Tea 2 k 


This result is known as Parsevaľs theorem after Marc Antoine Parseval 
(1755-1836). 


11.6. THE FOURIER TRANSFORM 


In the previous sections we discussed Fourier series for functions defined on 
a finite interval (or periodic functions defined on R, the set of all real 
numbers). In this section, we study a particular transformation of functions 
defined on R which are not periodic. 

Let f(x) be defined on R =(—%,%). The Fourier transform of f(x) is a 
function defined on R as 


1 „oœ : 
F(w) = af flied, (11.64) 


where i the complex number y — 1, and 


e7'”* = cos wx — i sin wx. 

A proper understanding of such a transformation requires some knowledge 
of complex analysis, which is beyond the scope of this book. However, due to 
the importance and prevalence of the use of this transformation in various 
fields of science and engineering, some coverage of its properties is neces- 
sary. For this reason, we merely state some basic results and properties 
concerning this transformation. For more details, the reader is referred to 
standard books on Fourier series, for example, Pinkus and Zafrany (1997, 
Chapter 3), Kufner and Kadlec (1971, Chapter 8), and Weaver (1989, 
Chapter 6). 


Theorem 11.6.1. If f(x) is absolutely integrable on R, then its Fourier 
transform F(w) exists. 
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Theorem 11.6.2. If f(x) is piecewise continuous and absolutely inte- 
grable on R, then its Fourier transform F(w) has the following properties: 


a. F(w) is a continuous function on R. 


b. lim F(w) =0. 


w> Fo 


Note that f(x) is piecewise continuous on R if it is piecewise continuous 
on each finite interval [a, b]. 


EXAMPLE 11.6.1. Let f(x)=e7'*!. This function is absolutely integrable 
on R, since 


"e idr=2f e~ de 
E J 
= 2. 
Its Fourier transform is given by 


F(w) = Zale er lets dy 
277 J x 


1.0 
= zz Je" (cos wx — i sin wx) dx 


1 œ 
=f e7 l*l cos wx dx 
2a = 


1 œ 
= = e “cos wxdx. 
T0 
Integrating by parts twice, it can be shown that 


ET 


EXAMPLE 11.6.2. Consider the function 


f(x) = e |x|<a, 


0 otherwise, 


where a is a finite positive number. This function is absolutely integrable on 
R, since 


J IAO ld= f" ds 


=2a. 
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Its Fourier transform is given by 


1 a —iwx 
FW) 5], e'Y* dx 


—a 


= (e —ei"*) 


TW 


The next theorem gives the condition that makes it possible to express the 
function f(x) in terms of its Fourier transform using the so-called inverse 
Fourier transform. 


Theorem 11.6.3. Let f(x) be piecewise continuous and absolutely inte- 
grable on R. Then for every point x €R where f’(x~) and f'(x*) exist, we 
have 


BIAC) +A f Few)e™ dw. 
In particular, if f(x) is continuous on R, then 
F(x) = f F(wye™* dw. (11.65) 


By applying Theorem 11.6.3 to the function in Example 11.6.1, we obtain 


iwx 


-i f E d 
í Pea á 


o% 1 
= cos wx + i sin wx ) dw 
ETE ) 


w COS WX 
— o m(1 + w?) 


2 = ON 5 
=—| ——dw 
=i 1+w? 


11.6.1. Fourier Transform of a Convolution 


Let f(x) and g(x) be absolutely integrable functions on R. By definition, the 
function 


nx) = f Fæ) dy (11.66) 


is called the convolution of f(x) and g(x) and is denoted by (f * g Xx). 
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Theorem 11.6.4. Let f(x) and g(x) be absolutely integrable on R. Let 
F(w) and G(w) be their respective Fourier transforms. Then, the Fourier 
transform of the convolution (f * g)(x) is given by 277F(w)G(w). 


11.7. APPLICATIONS IN STATISTICS 


Fourier series have been used in a wide variety of areas in statistics, such as 
time series, stochastic processes, approximation of probability distribution 
functions, and the modeling of a periodic response variable, to name just a 
few. In addition, the methods and results of Fourier analysis have been 
effectively utilized in the analytic theory of probability (see, for example, 
Kawata, 1972). 


11.7.1. Applications in Time Series 


A time series is a collection of observations made sequentially in time. 
Examples of time series can be found in a variety of fields ranging from 
economics to engineering. Many types of time series occur in the physical 
sciences, particularly in meteorology, such as the study of rainfall on succes- 
sive days, as well as in marine science and geophysics. 

The stimulus for the use of Fourier methods in time series analysis is the 
recognition that when observing data over time, some aspects of an observed 
physical phenomenon tend to exhibit cycles or periodicities. Therefore, when 
considering a model to represent such data, it is natural to use models that 
contain sines and cosines, that is, trigonometric models, to describe the 
behavior. Let y4, y>,..., Y, denote a time series consisting of n observations 
obtained over time. These observations can be represented by the trigono- 
metric polynomial model 


a m 
o Y [acos wt +b,sin œt],  t=1,2,...,7, 
k=1 
where 

27k 

pe k=0,1,2,...,m, 

n 

2 n 

a, = — Dy, COS wt, k=0,1, sM, 
Mya 
2 n 

b,=— } ysin o;t, k=1,2,...,m 
n 
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The values w4, ,...,@,, are called harmonic frequencies. This model 
provides a decomposition of the time series into a set of cycles based on the 
harmonic frequencies. Here, n is assumed to be odd and equal to 2m + 1, so 
that the harmonic frequencies lie in the range 0 to m. The expressions for a, 
(k =0,1,...,m) and b, (k =1,2,...,m) were obtained by treating the model 
as a linear regression model with 2m + 1 parameters and then fitting it to the 
2m + 1 observations by the method of least squares. See, for example, Fuller 
(1976, Chapter 7). 
The quantity 


n 
1Co) = 5 (ai + bi); k=1,2,...,m, (11.67) 


represents the sum of squares associated with the frequency w,. For k= 
1,2,...,m, the quantities in (11.67) define the so-called periodogram. 

If y4, ¥5,---,y, are independently distributed as normal variates with zero 
means and variances o*, then the a,’s and b,’s, being linear combinations of 
the ys, will be normally distributed. They are also independent, since the 
sine and cosine functions are orthogonal. It follows that [n /(2a*)az + bz), 
for k=1,2,...,m, are distributed as independent chi-squared variates with 
two degrees of freedom each. The periodogram can be used to search for 
cycles or periodicities in the data. 

Much of time series data analysis is based on the Fourier transform and its 
efficient computation. For more details concerning Fourier analysis of time 
series, the reader is referred to Bloomfield (1976) and Otnes and Enochson 
(1978). 


11.7.2. Representation of Probability Distributions 


One of the interesting applications of Fourier series in statistics is in 
providing a representation that can be used to evaluate the distribution 
function of a random variable with a finite range. Woods and Posten (1977) 
introduced two such representations by combining the concepts of Fourier 
series and Chebyshev polynomials of the first kind (see Section 10.4.1). These 
representations are given by the following two theorems: 


Theorem 11.7.1. Let X be a random variable with a cumulative distribu- 
tion function F(x) defined on [0,1]. Then, F(x) can be represented as a 
Fourier series of the form 


(0, x <0, 
F(x) = C 6/7—-X2_,b, sinnð, O<x<1, 


n=1"n 


1, x>1, 
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where 0= Arccos(2x — 1), b, =[2/(n7)JE[T*(X)], and E[T*(X)] is the 
expected value of the random variable 


T*(X) =cos[n Arccos(2X—-1)], O<X<1. (11.68) 


Note that T*(x) is basically a Chebyshev polynomial of the first kind and 
of the nth degree defined on [0, 1]. 


Proof. See Theorem 1 in Woods and Posten (1977). oO 


The second representation theorem is similar to Theorem 11.7.1, except 
that X is now assumed to be a random variable over [—1, 1]. 


Theorem 11.7.2. Let X be a random variable with a cumulative distribu- 
tion function F(x) defined on [—1,1]. Then 


(0, x< -1, 
E ena -1<x<1, 
1, x>1, 


where 0= Arccos x, b, =[2/(n7)IE[T,X)], E[T,(X)] is the expected value 
of the random variable 


T,(X) =cos[n Arccos X], =bax <1, 


and T,(x) is Chebyshev polynomial of the first kind and the nth degree [see 
formula (10.12)]. 


Proof. See Theorem 2 in Woods and Posten (1977). o 


To evaluate the Fourier series representation of F(x), we must first 
compute the coefficients b,. For example, in Theorem 11.7.2, b,= 
[2/(nm)]EIT,(X)]. Since the Chebyshev polynomial T,(x) can be written in 
the form 


n 
T(x)= } anx,  n=1,2,..., 
k=0 


the computation of b„ is equivalent to evaluating 


where w, = E(X*) is the kth noncentral moment of X. The coefficients a,,, 
can be obtained by using the recurrence relation (10.16), that is, 


T,44(*) = 2xT,(x) — T,_1(%), n=1,2,..., 
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with T(x) = 1, T(x) =x. This allows us to evaluate the a,,,’s recursively. The 
series 


0 ioe) 
F(x) =1-—— È} b, sin no 


n=1 
is then truncated at n = N. Thus 


6 N 
BO) aio = b, sin k0. 
=1 


k 


Several values of N can be tried to determine the sensitivity of the approxi- 
mation. We note that this series expansion provides an approximation of 
F(x) in terms of the noncentral moments of X. Good estimates of these 
moments should therefore be available. 

It is also possible to extend the applications of Theorems 11.7.1 and 11.7.2 
to a random variable X with an infinite range provided that there exists a 
transformation which transforms X to a random variable Y over [0,1], or 
over [—1, 1], such that the moments of Y are known from the moments of X. 

In another application, Fettis (1976) developed a Fourier series expansion 
for Pearson Type IV distributions. These are density functions, f(x), that 
satisfy the differential equation 


df(x) -(x+a) 


2 
dx Co tex +c,x 


f(x), 


where a,cy),c,, and c, are constants determined from the central moments 
His M2, M3, and m4, which can be estimated from the raw data. The data are 
standardized so that u, =0, m, = 1. This results in the following expressions 
for a, Co, C4, and c,: 


2a—1 
07 tea) 
c =a 
= um(a- 1) 
at+1 ’ 
1 
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where 


3( u,- u3- 1) 
2u, — 33-6 | 


Fettis (1976) provided additional details that explain how to approximate the 
cumulative distribution function, 


F(x) = [fo dt, 


using Fourier series. 


11.7.3. Regression Modeling 


In regression analysis and response surface methodology, it is quite common 
to use polynomial models to approximate the mean 7 of a response variable. 
There are, however, situations in which polynomial models are not adequate 
representatives of the mean response, as when 7 is known to be a periodic 
function. In this case, it is more appropriate to use an approximating function 
which is itself periodic. 

Kupper (1972) proposed using partial sums of Fourier series as possible 
models to approximate the mean response. Consider the following trigono- 
metric polynomial of order d, 


d 
n=a,t+ }, [a, cosnd+ B, sinnd], (11.69) 
u=1 


where 0 < ¢< 27 represents either a variable taking values on the real line 
between 0 and 27, or the angle associated with the polar coordinates of a 
point on the unit circle. Let u = (u4, u,), where u; = cos ¢, u, = sin d. Then, 
when d = 2, the model in (11.69) can be written as 


N= Ay) + aU, + Biu, + u? — a,u? + 2B, UU, (11.70) 
since sin2 = 2sin $ cos d= 2u,u,, and cos2¢=cos* ġ — sin? @=uj — uż. 
One of the objectives of response surface methodology is the determina- 
tion of optimum settings of the model’s control variables that result in a 
maximum (or minimum) predicted response. The predicted response f at a 
point provides an estimate of 7 in (11.69) and is obtained by replacing 
Qo, @,, B, in (11.69) by their least-squares estimates âp, @,, and B,, respec- 
tively, n = 1,2,...,d. For example, if d = 2, we have 


2 
f=â + D |â, cosnd+t Ê, sinne], (11.71) 
n=1 
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which can be expressed using (11.70) as 


where b = (a, ÊD and 


with u'u = 1. The method of Lagrange multipliers can then be used to 
determine the stationary points of ŷ subject to the constraint u'u = 1. Details 
of this procedure are given in Kupper (1972, Section 3). In a follow-up paper, 
Kupper (1973) presented some results on the construction of optimal designs 
for model (11.69). 

More recently, Anderson-Cook (2000) used model (11.71) in experimental 
situations involving cylindrical data. For such data, it is of interest to model 
the relationship between two correlated components, one a standard linear 
measurement y, and the other an angular measure ¢. Examples of such data 
arise, for example, in biology (plant or animal migration patterns), and 
geology (direction and magnitude of magnetic fields). The fitting of model 
(11.71) is done by using the method of ordinary least squares with the 
assumption that y is normally distributed and has a constant variance. 
Anderson-Cook used an example, originally presented in Mardia and Sutton 
(1978), of a cylindrical data set in which y is temperature (measured in 
degrees Fahrenheit) and @ is wind direction (measured in radians). Based on 
this example, the fitted model is 


¥ = 41.33 — 2.43 cos d — 2.60sin 6+ 3.05 cos2¢ + 2.98sin2¢. 


The corresponding standard errors of â), â}, Bi a>, Ê- are 1.1896, 1.6608, 
1.7057, 1.4029, 1.7172, respectively. Both «œg and a, are significant parame- 
ters at the 5% level, and $, is significant at the 10% level. 


11.7.4. The Characteristic Function 


We have seen that the moment generating function (t) for a random 
variable X is used to obtain the moments of X (see Section 5.6.2 and 
Example 6.9.8). It may be recalled, however, that f(t) may not be defined for 
all values of t. To generate all the moments of X, it is sufficient for (t) to 
be defined in a neighborhood of tf = 0 (see Section 5.6.2). Some well-known 
distributions do not have moment generating functions, such as the Cauchy 
distribution (see Example 6.9.1). 

Another function that generates the moments of a random variable in a 
manner similar to f(t), but is defined for all values of t and for all random 


506 FOURIER SERIES 


variables, is the characteristic function. By definition, the characteristic func- 
tion of a random variable X, denoted by ¢,(1), is 


b(t) =E[e"*] 


= e dF(x), (11.72) 


where F(x) is the cumulative distribution function of X, and i is the complex 


number y—1. If X is discrete and has the values c,,c,,...,c,,..-., then 
(11.72) takes the form 
pA) = L ple)e", (11.73) 
j=l 


where p(c))=P[X=c;,], j=1,2,.... If X is continuous with the density 
function f(x), then 


b(t) =f ef(x) de. (11.74) 


The function ¢,(¢) is complex-valued in general, but is defined for all values 
of t, since e'* = cos & + isin tx, and both f? „cos #dF(x) and f? „sin xdF(x) 
exist by the fact that 


[cos we|dF(x) < f dF(x) =1, 
f isin tc |dF(x) < fJ aF) =i, 


The characteristic function and the moment generating function, when the 
latter exists, are related according to the formula 


b(t) = (it). 


Furthermore, it can be shown that if X has finite moments, then they can be 
obtained by repeatedly differentiating #,(t) and evaluating the derivatives at 
zero, that is, 


1 d"$,(t) 
E(X")=— 
( ) i” dt” 10 


n=1,2,.... 


Although ¢,(t) generates moments, it is mainly used as a tool to derive 
distributions. For example, from (11.74) we note that when X is continuous, 
the characteristic function is a Fourier-type transformation of the density 
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function f(x). This follows from (11.64), the Fourier transform of f(x), which 
is given by (1/27) {~., f(x)e' dx. If we denote this transform by G(t) , 
then the relationship between ¢,(t) and G(t) is given by 


dt) =24G(-t). 


By Theorem 11.6.3, if f(x) is continuous and absolutely integrable on R, 
then f(x) can be derived from ¢,(¢) by using formula (11.65), which can be 
written as 


f(x) =f Gedi 
1 „>œ . 
= xf bl Demat 


Lo 
= ral Pelee at. (11.75) 


This is known as the inversion formula for characteristic functions. Thus 
the distribution of X can be uniquely determined by its characteristic 
function. There is therefore a one-to-one correspondence between distribu- 
tion functions and their corresponding characteristic functions. This provides 
a useful tool for deriving distributions of random variables that cannot be 
easily calculated, but whose characteristic functions are straightforward. 
Waller, Turnbull, and Hardin (1995) reviewed and discussed several algo- 
rithms for inverting characteristic functions, and gave several examples from 
various areas in statistics. Waller (1995) demonstrated that characteristic 
functions provide information beyond what is given by moment generating 
functions. He pointed out that moment generating functions may be of more 
mathematical than numerical use in characterizing distributions. He used an 
example to illustrate that numerical techniques using characteristic functions 
can differentiate between two distributions, even though their moment gen- 
erating functions are very similar (see also McCullagh, 1994). 

Luceño (1997) provided further and more general arguments to show that 
characteristic functions are superior to moment generating and probability 
generating functions (see Section 5.6.2) in their numerical behavior. 

One of the principal uses of characteristic functions is in deriving limiting 
distributions. This is based on the following theorem (see, for example, 
Pfeiffer, 1990, page 426): 


Theorem 11.7.3. Consider the sequence {F,(x)}*_, of cumulative distri- 
bution functions. Let {¢,(H)}7_, be the corresponding sequence of character- 
istic functions. 


a. If F(x) converges to a distribution function F(x) at every point of 
continuity for F(x), then ¢,.(t) converges to ¢,(t) for all t, where ,(¢) 
is the characteristic function for F(x). 
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b. If #,-(¢) converges to #,(t) for all t and @¢,(t) is continuous at t = 0, 
then ġ,(t) is the characteristic function for a distribution function F(x) 
such that F(x) converges to F(x) at each point of continuity of F(x). 


It should be noted that in Theorem 11.7.3 the condition that the limiting 
function @,(t) is continuous at f=0 is essential for the validity of the 
theorem. The following example shows that if this condition is violated, then 
the theorem is no longer true: 

Consider the cumulative distribution function 


0, x<-—n, 
x+n 

F(x)= ae a <x<n, 
1, x>n. 


The corresponding characteristic function is 


ok: a itx dx 
hdz e 


sin nt 


nt 


As n —>%, ¢,.(t) converges for every ¢ to $.(t) defined by 


1, t=0, 
6.1) = {4 t#0. 


Thus, ¢.(¢) is not continuous for t = 0. We note, however, that F(x) > p for 
every fixed x. Hence, the limit of F,(x) is not a cumulative distribution 
function. 


EXAMPLE 11.7.1. Consider the distribution defined by the density func- 
tion f(x) =e™* for x> 0. Its characteristic function is given by 


pD) = felted 


0 


ioe) 
= e™0i® dx 
0 


1 
ne 


EXAMPLE 11.7.2. Consider the Cauchy density function 


1 
(1 +x?)’ 


f(x) 


=o <x <0, 
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given in Example 6.9.1. The characteristic function is 


1 A ei 
jae 
(4) T -s1 +x? 
1 „œ cost i „œ sinx 
aod Oe ee cae re 
T =a 1x Tep 1+x 


1 „œ cost 


T ol +x? 


=e lf. 


Note that this function is not differentiable at t=0. We may recall from 
Example 6.9.1 that all moments of the Cauchy distribution do not exist. 


EXAMPLE 11.7.3. In Example 6.9.8 we saw that the moment generating 
function for the gamma distribution G(a, B) with the density function 


xt le-x/B 


Mo) = Tay pe 


a>0, B>0, 0<x<%, 
is #(t) =(1 — Bt)~*. Hence, its characteristic function is ¢.(¢) = (0 —iBt)~*. 
EXAMPLE 11.7.4. The characteristic function of the standard normal dis- 


tribution with the density function 


1 
f(x) = eA, =o <x <%, 


1 o DER 
f(t) = Vr fe /2 elk dy 


= T e720? -2itx) dy 
V27 — œ 
SEAD a 

= a f en 207i? dy 
V27 — o 


=k 
=e", 
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Vice versa, the density function can be retrieved from ¢,(t) by using the 
inversion formula (11.75): 


1 >% , 
fa) =— | e Pe dt 

2T ma 
1 œ 

= — 4(t?+2itx) 

=—| e? dt 
FI 

= nj en HP + 2x) + Cix] olx? dy 
2T —00 


en /2 a 1 


Va er 


ew rttiayY dt 


—x? /2 


e wo 1 5 

e`" [2 du 
v2 | aoe 
ent’ /2 


oe 


11.7.4.1. Some Properties of Characteristic Functions 


The book by Lukacs (1970) provides a detailed study of characteristic 
functions and their properties. Proofs of the following theorems can be found 
in Chapter 2 of that book. 


Theorem 11.7.4. Every characteristic function is uniformly continuous on 
the whole real line. 


Theorem 11.7.5. Suppose that ¢,.(t), $,,(¢),..., ¢,,(t) are characteristic 
functions. Let a,,a,,...,a, be nonnegative numbers such that X;_;a;=1. 


Then £7_,a,¢;-(t) is also a characteristic function. 


Theorem 11.7.6. The characteristic function of the convolution of two 
distribution functions is the product of their characteristic functions. 


Theorem 11.7.7. The product of two characteristic functions is a charac- 
teristic function. 
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EXERCISES 
In Mathematics 


11.1. Expand the following functions using Fourier series: 


(a) f(x) = |x], —T7<x<7. 
(b) f(x) = |sinx|. 
(ce) f(x) =x +x’, —W<xX<7T. 


11.2. Show that 


o0 1 T? 


Z nat E 
[ Hint: Use the Fourier series for x?.] 
11.3. Let a, and b, be the Fourier coefficients for a continuous function 
f(x) defined on [— m, m] such that f(— m) = f(r), and f'(x) is piece- 
wise continuous on [— 77, m]. Show that 


(a) lim, _,.{na,) = 0, 
(b) lim, .(b,) = 0. 


11.4. If f(x) is continuous on [— m, m], f(— 7) = f(r), and f'(x) is piece- 
wise continuous on [— 77, 7], then show that 


IF) —5,(4)|s 


c 


> 
n 


a 


where 


ag 


s(x) => + [a, cos kx + b, sin kx], 
=1 


k= 


and 


C= To dx. 
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11.5. 


11.6. 


11.7. 


11.9. 


11.10. 


11.11. 


Suppose that f(x) is piecewise continuous on [— 7, m] and has the 
Fourier series given in (11.32). 
(a) Show that X7 _ e 1)"b,,/n is a convergent series. 
(b) Show that L*_,5,,/n is convergent. 
[ Hint: Use Theorem 11.3.2.] 


Show that the trigonometric series, £% _,(sin nx)/log n, is not a Fourier 
series of any integrable function. 

[ Hint: If it were a Fourier series of a function f(x), then b, = 1/logn 
would be the Fourier coefficient of an odd function. Apply now part 
(b) of Exercise 11.5 and show that this assumption leads to a contra- 
diction. ] 


Consider the Fourier series of f(x) =x given in Example 11.2.1. 
(a) Show that 


= E 
-È = E for = m<x< T. 
_ n? 


[ Hint: Consider the Fourier series of fë f(t) dt.] 
(b) Deduce that 
( e: d) n+1 m? 


Pee ae a. 


12 


. Make use of the result in Exercise 11.7 to find the sum of the series 


LY? _,[(—1)"*! sin nx]/n?. 


—x? 


Show that the Fourier transform of f(x) =e 
by 


, =% <x < %, is given 


—w? /4 


1 
F(w)=- 


[Hint: Show that F'(w) + iwF(w) =0.] 
Prove Theorem 11.6.4 using Fubini’s theorem. 


Use the Fourier transform to solve the integral equation 


[fe —y)f(y) dy=e* 7 


for the function f(x). 
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11.12. Consider the function f(x) =x, —m <x < m, with f(—7)=f(7) =0, 
and f(x) 2a-periodic defined on (—%,%). The Fourier series of f(x) 
is 

2(-1)"*" sin nx 


n 


[ee] 
2 
n=1 


Let s,(x) be the nth partial sum of this series, that is, 


n -14 
s,(x)= D} (=) sin kx. 
k=1 k 
Let x, =m- m/n. 
(a) Show that 
= 


S,(X,) -È 


(b) Show that 


7 Sin Eir 


lim Sa( Xn) =a e 
=1.18r. 


Note: As n>, x, > m . Hence, for n sufficiently large, 
5,(X,) —f(x,) = 1184 — T= 0.187. 


Thus, near x = 7 [a point of discontinuity for f(x)], the partial 
sums of the Fourier series exceed the value of this function by 
approximately the amount 0.187 = 0.565. This illustrates the so- 
called Gibbs phenomenon according to which the Fourier series of 
f(x) “overshoots” the value of f(x) in a small neighborhood to 
the left of the point of discontinuity of f(x). It can also be shown 
that in a small neighborhood to the right of x = — 7, the Fourier 
series of f(x) “undershoots” the value of f(x). 


In Statistics 


11.13. In the following table, two observations of the resistance in ohms are 
recorded at each of six equally spaced locations on the perimeter of a 
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new type of solid circular coil (see Kupper, 1972, Table 1): 


¢ (radians) Resistance (ohms) 
0 13.62, 14.40 
a/3 10.552, 10.602 
27/3 2.196, 3.696 
T 6.39, 7.25 
4/3 8.854, 10.684 
5a/3 5.408, 8.488 


(a) Use the method of least squares to estimate the parameters in the 
following trigonometric polynomial of order 2: 
2 
n=a,+ }, [a, cosn¢d+ B, sinne], 
n=1 
where 0 < p < 2m, and 7 denotes the average resistance at loca- 
tion œ. 
(b) Use the prediction equation obtained in part (a) to determine the 
points of minimum and maximum resistance on the perimeter of 
the circular coil. 


11.14. Consider the following circular data set in which ¢ is wind direction 
and y is temperature (see Anderson—Cook, 2000, Table 1). 


¢ (radians) y CF) ġ (radians) y CF) 
4.36 52 4.54 38 
3.67 41 2.62 40 
4.36 41 2.97 49 
1.57 31 4.01 48 
3.67 53 4.19 37 
3.67 47 5.59 37 
6.11 43 5.59 33 
5.93 43 3.32 47 
0.52 41 3.67 51 
3.67 46 1.22 42 
3.67 48 4.54 53 
3.32 52 4.19 46 
4.89 43 3.49 51 
3.14 46 4.71 39 


Fit a second — order trigonometric polynomial to this data set, and 
verify that the prediction equation is given by 


y = 41.33 — 2.43 cos @ — 2.60 sin 6 + 3.05 cos2¢ + 2.98 sin2 ¢. 
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11.15. Let {Y}; be a sequence of independent, indentically distributed 


random variables with mean mw and variance o°. Let 


* = 


n M 
n oin 


S 


where Y, = (1/mE?_Y;,. 
(a) Find the characteristic function of s*. 


(b) Use Theorem 11.7.3 and part (a) to show that the limiting distri- 
bution of s* as n > © is the standard normal distribution. 


Note: Part (b) represents the statement of the well-known central 
limit theorem, which asserts that for large n, the arithmetic mean 
Y, of a sample of independent, identically distributed random 
variables is approximately normally distributed with mean and 


standard deviation o/ vn. 


CHAPTER 12 


Approximation of Integrals 


Integration plays an important role in many fields of science and engineering. 
For applications, numerical values of integrals are often required. However, 
in many cases, the evaluation of integrals, or quadrature, by elementary 
functions may not be feasible. Hence, approximating the value of an integral 
in a reliable fashion is a problem of utmost importance. Numerical quadra- 
ture is in fact one of the oldest branches of mathematics: the determination, 
approximately or exactly, of the areas of regions bounded by lines or curves, a 
subject which was studied by the ancient Babylonians (see Haber, 1970). The 
word “quadrature” indicates the process of measuring an area inside a curve 
by finding a square having the same area. Probably no other problem has 
exercised a greater or a longer attraction than that of constructing a square 
equal in area to a given circle. Thousands of people have worked on this 
problem, including the ancient Egyptians as far back as 1800 B.c. 

In this chapter, we provide an exposition of methods for approximating 
integrals, including those that are multidimensional. 


12.1. THE TRAPEZOIDAL METHOD 


This is the simplest method of approximating an integral of the form 
J? f(x) dx, which represents the area bounded by the curve of the function 
y=f(x) and the two lines x =a, x =b. The method is based on approximat- 
ing the curve by a series of straight line segments. As a result, the area is 
approximated with a series of trapezoids. For this purpose, the interval from 
a to b is divided into n equal parts by the partition points a= 
Xos X1,X,-..,X, =b. For the ith trapezoid, which lies between x;_, and x;, 
its width is h =(1/n)(b — a) and its area is given by 


h 
Ai= 5 [f(%ir) +fla)], $= 1,2,...50. (12.1) 
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The sum, S,, of A,,A,,...,A, provides an approximation to the integral 


PF (x) dx. 


h 

= sto) +f] + FO) +£O0)] + +P Gn) + fC] 
h n-1 

= f(%o) +f) +2 D f(r) . (12.2) 


12.1.1. Accuracy of the Approximation 


The accuracy in the trapezoidal method depends on the number n of 
trapezoids we take. The next theorem provides information concerning the 
error or approximation. 


Theorem 12.1.1. Suppose that f(x) has a continuous second derivative 
on [a,b], and |f"(x)|<M, for all x in [a,b]. Then 


3 
b (b-a) M, 
dx — S, | < —— >, 
s (x) n 12n? 
where S, is given by formula (12.2). 
Proof. Consider the partition points a =x 9,x,,X,...,xX, =b such that 


h=x;-x;_, =A/n\(b — a), i=1,2,...,n. The integral of f(x) from x;_, to 
x; is 


= f” f(x) dx. (12.3) 


Now, in the trapezoidal method, f(x) is approximated in the interval [x;_,, x;] 
by the right-hand side of the straight-line equation, 


1 
P(x) =f(x;-1) + 7 faa) = f(xi-1)|(* -x;-1) 
= f(x;) 


Xj-1 


fd), i=1,2,...,n 


es 
i-1) + 


| 


eli) + 


Xj-1 7 


Note that p,(x) is a linear Lagrange interpolating polynomial (of degree 
n=1) with x;_; and x; as its points of interpolation [see formula (9.14)]. 
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Using Theorem 9.2.2, the error of interpolation resulting from approximating 
f(x) with px) over [x;_,, x;] is given by 


1 
f(x) —p;(x) z ah (ED =x) (x =x), i=1,2,..., A, (12.4) 


where x;_, < é <x; Formula (12.4) results from applying formula (9.15). 
Hence, the error of approximating I, with A; in (12.1) is 


I-A; = iM [ f(x) —p(x)] dx 

1 x; 

z T Ef om) dx 
1 

= a DIREA Siy) = (xia +x;) (x7 x71) +2x;_1x;(4;-2;-1)] 
h 

E. z &)[3(x7 RAG) ~ (xi +x) +x;_1%;| 
h 

= al (O [a(x to —x?)] 


h? 
=e Ek i=1,2,...,n. (12.5) 
The total error of approximating {?f(x) dx with S,, is then given by 
b h? Z n 
[f@) E= EE 
a 12 2 


It follows that 


nh?M, 


< 


J'A =S, 


_ (b-a)'M, 


oO 12.6 
12n? ( ) 


An alternative procedure to approximating the integral f? f(x) dx by a sum 
of trapezoids is to approximate /,"' f(x)dx by a trapezoid bounded from 
above by the tangent to the curve of y = f(x) at the point x;_, + h/2, which 
is the midpoint of the interval [x,;_,,x;]. In this case, the area of the ith 
trapezoid is 


h 
Ataf x. + 2],  i=1,2,..n. 


2 
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Hence, 


(12.7) 


[roari 


and f(x) is approximated in the interval [x;_,, x;] by 


L 


h 
př (x) f|- + 7 


By applying Taylor’s theorem (Theorem 4.3.1) to f(x) in a neighborhood of 
X;_, +h/2, we obtain 


TORCER: ae 
i-1 2 


1 h\? 
+g g z] F"(n) 


1 h\? 
= pi (x) + gl z z) f'(n); 


where 7; lies between x,;_;+h/2 and x. The error of approximating 
fè: f(x)dx with A? is then given by 


f Urasa h [x13] Pode 


x 


< 


Consequently, the absolute value of the total error in this case has an upper 
bound of the form 


n Axe x nh?M, 
Ef Orde 


L 


_ (b-a)'M, 


ED (12.8) 
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We note that the upper bound in (12.8) is half as large as the one in (12.6). 
This alternative procedure is therefore slightly more precise than the original 
one. Both procedures produce an approximating error that is O(1/n’). It 
should be noted that this error does not include the roundoff errors in the 
computation of the areas of the approximating trapezoids. 


EXAMPLE 12.1.1. Consider approximating the integral {?dx/x, which has 
an exact value equal to log2 ~ 0.693147. Let us divide the interval [1,2] into 
n= 10 subintervals of length h = $. Hence, x)= 1, x; =1.1,..., X19 = 2.0. 
Using the first trapezoidal method (formula 12.2), we obtain 


dx 1 1 9 1 

pene eal pene ye De 

1x 20 D 
= 0.69377. 


Using now the second trapezoidal method [formula (12.7)], we get 


ods 1 10 1 
J x 10 3 x;_, +0.05 


i=1 


= 0.69284. 


12.2. SIMPSON’S METHOD 


Let us again consider the integral [f f(x)dx. Let a =x) <x, < +++ <Xy,_1 < 
X,, =b be a sequence of equally spaced points that partition the interval 
[a,b] such that x;—x;_; =h, i=1,2,...,2n. Simpson’s method is based on 
approximating the graph of the function f(x) over the interval [x,;_,, x;,,] by 
a parabola which agrees with f(x) at the points x;_,, x;, and x;,,. Thus, over 
[x;_1,%;+1], f(x) is approximated with a Lagrange interpolating polynomial 
of degree 2 of the form [see formula (9.14)] 


(x —x;)(% -Xj41) 


q(x) =f(xi-1) (Gina= =e) 
(x —x;_1)(% -%)41) (x —x;_1)(x-x;) 
AG) (xixi) (4%; — 41) FG) (Xizi —%i-1) (4i41 7 X;) 
zA Sa ES e 
Hed (x=x;-1)(x=x;) 


2h? 
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It follows that 


EOL OL 


Xi-1 


F(%i-1 Mitt 
a oo (x —x))(4~X)41) dx 
an J mm aa ana) de 
po J -raea de 
o fxi) 2h? f(x) ( 4h? FCxiri) (20° 
2h? (=)- =| 3 |+ 2h? (Z) 


7 FU) FAA)» i=13,...,2n—1. 
(12.9) 


By adding up all the approximations in (12.9) for i=1,3,...,2n—1, we 
obtain 


h n n-1 
JIO) de~ FSFE) +4 EM) +2 E fra) HA) (12.10) 


As before, in the case of the trapezoidal method, the accuracy of the 
approximation in (12.10) can be figured out by using formula (9.15). Courant 
and John (1965, page 487), however, stated that the error of approximation 
can be improved by one order of magnitude by using a cubic interpolating 
polynomial which agrees with f(x) at x;—1, X; X;}1, and whose derivative at 
x; is equal to f’(x;). Such a polynomial gives a better approximation to f(x) 
over [x;_,,X;,,]than the quadratic one, and still provides the same approxi- 
mation formula (12.10) for the integral. If q;(x) is chosen as such, then the 
error of interpolation resulting from approximating f(x) with q;(x) over 
[x;-1,X:4,] is given by f(x) —9fx)=A/4)fO(E)x — x;_ x — x)? (x - 
X;41), where x;_, < &,<x;,,, provided that f(x) exists and is continuous 
on [a, b]. This is equivalent to using formula (9.15) with n =3 and with two 
of the interpolation points coincident at x;. We then have 


UUO -ald 


M, Xi+1 2 
< a et Sra ee, 


i=1,3,...,2n—1, (12.11) 
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where M, is an upper bound on |f“(x)| for a <x <b. By computing the 
integral in (12.11) we obtain 


Xi+1 M,h° 4 
i [ f(x) —4;(x)] dx} < 507” i=1,3,...,2n—-1. 


Consequently, the total error of approximation in (12.10) is less than or equal 
to 


nM,h’? M,(b—a) 
90 2880n* 


(12.12) 


since h = (b — a)/2n. Thus the error of approximation with Simpson’s method 
is O(1/n*), where n is half the number of subintervals into which [a, b] is 
divided. Hence, Simpson’s method yields a much more accurate approxima- 
tion than the trapezoidal method. 

As an example, let us apply Simpson’s method to the calculation of the 
integral in Example 12.1.1 using the same division of [1,2] into 10 subinter- 
vals, each of length h = +. By applying formula (12.10) we obtain 


2 dx =| | 1 1 1 1 1 | 


1x 3 


= 0.69315. 


12.3. NEWTON-COTES METHODS 


The trapezoidal and Simpson’s methods are two special cases of a general 
series of approximate integration methods of the so-called Newton—Cotes 
type. In the trapezoidal method, straight line segments were used to approxi- 
mate the graph of the function f(x) between a and b. In Simpson’s method, 
the approximation was carried out using a series of parabolas. We can refine 
this approximation even further by considering a series of cubic curves, 
quartic curves, and so on. For cubic approximations, four equally spaced 
points are used to subdivide each subinterval of [a, b] (instead of two points 
for the trapezoidal method and three points for Simpson’s method), whereas 
five points are needed for quartic approximation, and so on. All such 
approximations are of the Newton—Cotes type. 
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12.4. GAUSSIAN QUADRATURE 


All Newton—Cotes methods require the use of equally spaced points, as was 
seen in the cases of the trapezoidal method and Simpson’s method. If this 
requirement is waived, then it is possible to select the points in a manner that 
reduces the approximation error. 

Let xo <x, <x, < +++ <x, be n +1 distinct points in [a, b]. Consider the 
approximation 


b n 
MOLS ÈL oi f(x), (12.13) 
a i=0 
where the coefficients, wo, w),..., @,, are to be determined along with the 


points x,,x,,.--,X,- The total number of unknown quantities in (12.13) is 
2n + 2. Hence, 2n + 2 conditions must be specified. According to the so-called 
Gaussian integration rule, the œs and x,’s are chosen such that the approxi- 
mation in (12.13) will be exact for all polynomials of degrees not exceeding 
2n +1. This is equivalent to requiring that the approximation (12.13) be 
exact for f(x) =x/, j=0,1,2,...,2n + 1, that is, 


n 
[ide= Dox}, 7=0,1,2,..,2041. (12.14) 
i=0 


a 


This process produces 2n + 2 equations to be solved for the w,’s and x,’s. In 
particular, if the limits of integration are a = —1, b = 1, then it can be shown 
(see Phillips and Taylor, 1973, page 140) that the x,-values will be the n + 1 
zeros of the Legendre polynomial p,,,,(x) of degree n + 1 (see Section 10.2). 
The a,-values can be easily found by solving the system of equations (12.14), 
which is linear in the œs. 

For example, for n = 1, the zeros of p(x) = Gx? — 1) are x) = —1/ V3, 
x, =1/¥V3. Applying (12.14), we obtain 


1 
f &=0,+, => wtw =2, 
-1 


1 1 
f xdx= wxi + œx; => ——(-w+to)=0, 


-1 v3 


1 
eg PE 2 2 1 _ 2 
fe dx=W Xp +t oxi > 3(@)+,) =%, 


+o,)=0. 


1 
oa 3/5 6 0 


We note that the last two equations are identical to the first two. Solving the 


1 
3 ya 3 3 
f X dx= oxi + 0,33 = 
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latter for wọ and w4, we get wọ = w, = 1. Hence, we have the approximation 


1 m 1 1 
roa -a)l 


which is exact if f(x) is a polynomial of degree not exceeding 27n + 1 = 3. 

Tf the limits of integration are not equal to — 1,1, we can easily convert the 
integral /’f(x)dx to one with the limits —1,1 by making the change of 
variable 


_ 2x—(a+b) 


‘5 b-a 


This converts the general integral {?f(x)dx to the integral [(b — 
a)/2){1,g(z) dz, where 


b-a b+a 
wr! a a | 


We therefore have the approximation 
b b-a ” 
J f(x) d= —— È ogl), (12.15) 
a i=0 


where the z,’s are the zeros of the Legendre polynomial p, , ,(z). 
It can be shown that (see Davis and Rabinowitz, 1975, page 75) that when 
a= —1, b=1, the error of approximation in (12.13) is given by 


> b—a)y"*l(n+1)!14 
fd- È afa) =È e a 


(2n+2) b 
i=0 (2n + 3)[(2n +2] f (£), a<é<b, 


(12.16) 


provided that f?”+®(x) is continuous on [a, b]. This error decreases rapidly 
as n increases. Thus this Gaussian quadrature provides a very good approxi- 
mation with a formula of the type given in (12.13). 

There are several extensions of the approximation in (12.13). These 
extensions are of the form 


EOLO È of (x;), (12.17) 
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where A(x) is a particular positive weight function. As before, the coefficients 
Wo, @1,---,@, and the points x 9,x,,...,x,, which belong to [a,b], are 
chosen so that (12.17) is exact for all polynomials of degrees not exceeding 
2n + 1. The choice of the x;’s depends on the form of A(x). It can be shown 
that the values of x, are the zeros of a polynomial of degree n + 1 belonging 
to the sequence of polynomials that are orthogonal on [a, b] with respect to 
A(x) (see Davis and Rabinowitz, 1975, page 74; Phillips and Taylor, 1973, 
page 142). For example, if a= —1, b=1, Ax) = —-x)*(1+x)% a> -1, 
B> —1, then the x,’s are the zeros of the Jacobi polynomial pí% f(x) (see 
Section 10.3). Also, if a= —1, b=1, Ax)=(1—x7)7'/, then the x,’s are 
the zeros of the Chebyshev polynomial of the first kind, T, Ty 1) (see Section 
10.4), and so on. For those two cases, formula (12. 17) is called the 
Gauss—Jacobi quadrature formula and the Gauss—Chebyshev quadrature for- 
mula, respectively. The choice A(x) = 1 results in the original formula (12.13), 
which is now referred to as the Gauss—Legendre quadrature formula. 


EXAMPLE 12.4.1. Consider the integral {jdx/(1 +x), which has the exact 
value log 2 = 0.69314718. Applying formula (12.15), we get 


1 dx guy: 
ol+x Sed ae +1) 
n w 
= Do -1<z,<1, (12.18) 
i=0 i 


where the z,’s are the zeros of the Legendre polynomial p,,,,(z), z=2x-— 1. 

Let n=1; then p,(z)= ¿(3z?— 1) with zeros equal to z)= —1/V3, 
z,=1/V3. We have seen earlier that wọ = 1, w, = 1; hence, from (12.18), we 
obtain 


1 dx 
ol+x , 


T v3 v3 
3⁄3 -1 33 +1 


= 0.692307691. 


Let us now use n =2 in (12.18). Then p,(z) = $(5z? — 3z) (see Section 
10.2). Its zeros are z)= —(2)'”7, z,=0, z,=(2)!”. To find the œs, we 
apply formula (12.14) using a= —1, b=1, and z in place of x. For 
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j=0,1,2,3,4,5, we have 
1 
f dz = w) + w + w, 
-1 


1 
f a= WZ) + wz] + 52, 


1 
2 = 2 2 2 
f a dz = wző + wızí + @,z5, 


1 
f z? dz = wz + wz? + w,z3, 
=i 


1 
f z* dz = wz + wz + z4, 
= 


1 
f zî dz = wz + wz? + w,z3. 
-1 


These equations can be written as 


Wy + w +w, =2, 


The above six equations can be reduced to only three that are linearly 
independent, namely, 


w + w + w, =2, 
— wtw, =0, 


_ 10 
wy t w= y, 


the solution of which is wọ = 3, @, = $, œ, = 3. Substituting the w,’s and z;s 
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in (12.18), we obtain 
1 dx Wo ON w, 


= + + 
olt+x 3+z) $342, 3+z, 


S 8 5 


9 + 9 i 9 
T ce 
= 0.693121685. 


For higher values of n, the zeros of Legendre polynomials and the 
corresponding values of w; can be found, for example, in Shoup (1984, Table 
7.5) and Krylov (1962, Appendix A). 


12.5. APPROXIMATION OVER AN INFINITE INTERVAL 


Consider an integral of the form {*f(x) dx, which is improper of the first 
kind (see Section 6.5). It can be approximated by using the integral S? f(x) dx 
for a sufficiently large value of b, provided, of course, that /7f(x) dx is 
convergent. The methods discussed earlier in Sections 12.1-12.4 can then be 
applied to f} f(x) dx. 

For improper integrals of the first kind, of the form /fA(x)f(x) dx, 
[°..ACx) f(x) dx, we have the following Gaussian approximations: 


[Ax f(a) ae = È ufa), (12.19) 
[AG I(x) de = È ofa), (12.20) 


where, as before, the x;’s and œs are chosen so that (12.19) and(12.20) are 
exact for all polynomials of degrees not exceeding 2n + 1. For the weight 
function A(x) in (12.19), the choice A(x) =e~* gives the Gauss-Laguerre 
quadrature, for which the x;s are the zeros of the Laguerre polynomial 
L,„+1(x) of degree n + 1 and a=0 (see Section 10.6). The associated error of 
approximation is given by (see Davis and Rabinowitz, 1975, page 173) 


if OL eee Te ee aH 


nro hot OSES 
i=0 


=x? 


Choosing A(x) =e~* in (12.20) gives the Gauss—Hermite TAT and the 
corresponding x;s are the zeros of the Hermite polynomial H nix) of 
degree n+1 (see Section 10.5). The associated error of approximation is of 
the form (see Davis and Rabinowitz, 1975, page 174) 


T © (nt DWr 
fe F(x) d- Eo iS) = Taal 


FERACE), — o< <o, 
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We can also use the Gauss-Laguerre and the Gauss—Hermite quadrature 
formulas to approximate convergent integrals of the form [jf(x) dx, 


fof) dx: 
Rie dx = f eefa) dx 
0 


t 
M= > 
£ 
Loy] 
3 
Ss 
~ 
al 
YS 


ll 
—S, 
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~ 
=x 
w 


J od 
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Š 
$ 

* 
VY 


EXAMPLE 12.5.1. Consider the integral 


j o eX 
-Jo 1=e7?”* 
-Le w;f(x;), (12.21) 


where f(x)=x(1—e7?*)! and the x;s are the zeros of the Laguerre 
polynomial L,,,,(x). To find expressions for L,,(x), n =0,1,2,..., we can use 
the recurrence relation (10.37) with a = 0, which can be written as 


dL,(x) 
Lyy(*) =(x-n—-I1)L,(x) -x T, n=0,1,2,.... (12.22) 


Recall that L(x) = 1. Choosing n = 1 in (12.21), we get 


I= wgf(x0) + œ f(x). (12.23) 
From (12.22) we have 
L(x) =(*-1)L,(*) 


=x-1, 


d(x-1) 
L(x) = (x — 2) L,(x) —x— — 
=(x-—2)(x-1)-x 
=x? —4x42. 


The zeros of L(x) are xọ=2-— V2, x} =2+ V2. To find w and o, 
formula (12.19) must be exact for all polynomials of degrees not exceeding 


530 


2n + 1=3. This is equivalent to requiring that 


o0 


[eM de= ot 1, 


o0 
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Í e~™™*xdx = wX + wX], 
0 


(ee) 


-x42 dy = 2 2 
fe x^ dx = wX + 1X7, 


o0 


-x y3 Jy = 3 3 
fe x’dx = wxj + wx]. 


Only two equations are linearly independent, the solution of which is wy = 
0.853553, w, = 0.146447. From (12.23) we then have 


MoXo wX] 
I= 1 — e7?“ 1—e 2X1 
= 1.225054. 


Let us now calculate (12.21) using n=2,3,4. The zeros of Laguerre 
polynomials L(x), L4(x), L;(x), and the values of wœ; for each n are shown 
in Table 12.1. These values are given in Ralston and Rabinowitz (1978, page 
106) and also in Krylov (1962, Appendix C). The corresponding approximate 
values of J are given in Table 12.1. It can be shown that the exact value of I 


is 72/8 = 1.2337. 


Table 12.1. Zeros of Laguerre Polynomials (x;), Values of w,, 
and the Corresponding Approximate Values’ of I 


n Xx; w; I 
1 0.585786 0.853553 
3.414214 0.146447 1.225054 
2 0.415775 0.711093 
2.294280 0.278518 
6.289945 0.010389 1.234538 
3 0.322548 0.603154 
1.745761 0.357419 
4.536620 0.038888 
9.395071 0.000539 1.234309 
4 0.263560 0.521756 
1.413403 0.398667 
3.596426 0.075942 
7.085810 0.003612 
12.640801 0.000023 1.233793 


“See (12.21). 
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12.6. THE METHOD OF LAPLACE 


This method is used to approximate integrals of the form 
b Ah(x) 
Ià) = f o(x)e”® dx, (12.24) 


where A is a large positive constant, g(x) is continuous on [a, b], and the first 
and second derivatives of h(x) are continuous on [a, b]. The limits a and b 
may be finite or infinite. This integral was used by Laplace in his original 
development of the central limit theorem (see Section 4.5.1). More specifi- 
cally, if X,, X,,...,X,,... is a sequence of independent and identically 
distributed random variables with a common density function, then the 
density function of the sum S, = L?_,X;, can be represented in the form 
(12.24) (see Wong, 1989, Chapter 2). 

Suppose that A(x) has a single maximum in the interval [a,b] at x=¢, 
a<t<b, where h'(t)=0 and h"(t) <0. Hence, e*”“ is maximized at t for 
any A > 0. Suppose further that e*“) becomes very strongly peaked at x = t 
and decreases rapidly away from x =t on [a,b] as A œ. In this case, the 
major portion of J(A) comes from integrating the function g(x)e’ over a 
small neighborhood around x =tf. Under these conditions, it can be shown 
that if a<t<b, and as A> %, 


-2r 1? 
| ; (12.25) 


A eeo] Ah" (t) 


where ~ denotes asymptotic equality (see Section 3.3). Formula (12.25) is 
known as Laplace’s approximation. 

A heuristic derivation of (12.25) can be arrived at by replacing (x) and 
h(x) by the leading terms in their Taylor’s series expansions around x =t. 
The integration limits are then extended to —~ and %, that is, 


[Pelayo de = feel ance) $ O] dx 
= J” eelan) -+ Tæ- dx 


oo À 
= p(t)e™™ f exp] (r= 1)*H(1) as (12.26) 


-2r | 
| (12.27) 


soez 


Formula (12.27) follows from (12.26) by making use of the fact that {je — dx 
1 


=3I(4)= Vr /2, where I(-) is the gamma function (see Example 6.9.6), or 
by simply evaluating the integral of a normal density function. 
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If t =a, then it can be shown that as A > ~, 


(12.28) 


I() ~ otayerno| 


Rigorous proofs of (12.27) and (12.28) can be found in Wong (1989, 
Chapter 2), Copson (1965, Chapter 5), Fulks (1978, Chapter 18), and Lange 
(1999, Chapter 4). 


EXAMPLE 12.6.1. Consider the gamma function 
P(n+1)=f e*x"dx, n> -1. (12.29) 
0 


Let us find an approximation for F(n + 1) when n is large an positive, but 
not necessarily an integer. Let x = nz; then (12.29) can be written as 


T(n+1) =n f e-"exp[nlog(nz)] dz 
0 
=n fe"exp[n log n +n log zldz 
0 
=n! f exp[n(=z + log z)] dz. (12.30) 


Let h(z)= —z + log z. Then A(z) has a unique maximum at z=1 with 
W(1)=0 and A"(1) = —1. Applying formula (12.27) to (12.30), we obtain 


SIm 1/2 


n(—1) 


=e "n"V27Nn, (12.31) 


T(n +1) ~ atten 


as n > œ. Formula (12.31) is known as Stirling’s formula. 


EXAMPLE 12.6.2. Consider the integral, 
I,(A) = : f À dx 
n(à) = — | exp(Acos x)cos nxdx, 
(à) F p( ) 


as à— œ. This integral looks like (12.24) with A(x) = cos x, which has a 
single maximum at x = 0 in [0, 7]. Since h"(0) = —1, then by applying (12.28) 


MULTIPLE INTEGRALS 533 


we obtain, as A > %, 


EXAMPLE 12.6.3. Consider the integral 
I(A) = [exp[ Asin x|dx 
0 


as A>, Here, h(x) =sin x, which has a single maximum at x= 7/2 in 
[0, 7], and A"(a/2) = —1. From (12.27), we get 


-2r 1'? 
10 ~e' Le | 
s 27 
=e ce 


as A> œ>, 


12.7. MULTIPLE INTEGRALS 


We recall that integration of a multivariable function was discussed in 
Section 7.9. In the present section, we consider approximate integration 
formulas for an n-tuple Riemann integral over a region D in an n-dimen- 
sional Euclidean space R”. 

For example, let us consider the double integral I= f fp f(x, x2) dx, dx, 
where D CR? is the region 


D = {(x1,x2)|a <x, <b, (x1) <x. < o(x,)}. 


Then 
b (xı 
I= f | f(%1, x2) ar} 
a w(x) 
b 
= | g(x,) dy, (12.32) 
where 


g(x) = ip YF) dis. (12.33) 
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Let us now apply a Gaussian integration rule to (12.32) using the points 


Xi =Zo,24,---,Z,, With the matching coefficients wo, @),..., @n; 
b m 
f g(x) dx, = È, œg(z;). 
a i=0 
Thus 
Ao fe 
I= } o; "F( Zi X2) E>. (12.34) 
i=0 ylz) 
For the ith of the m +1 integrals in (12.34) we can apply a Gaussian 
integration rule using the points y,o,y;,,...,);, With the corresponding 
coefficients Ujo, Uji». -+ Uine We then have 


(z,) á 
ihe f( Zi, x2) By = È vfi Yy) 


D j=0 


Hence, 


m 
I=) wv; f (Zi Yij)- 


i=0 j=0 


— 


This procedure can obviously be generalized to higher-order multiple inte- 
grals. More details can be found in Stroud (1971) and Davis and Rabinowitz 
(1975, Chapter 5). 

The method of Laplace in Section 12.6 can be extended to an n-dimen- 
sional integral of the form 


(A) = Jee dx, 


which is a multidimensional version of the integral in (12.24). Here, D is a 
region in R”, which may be bounded or unbounded, A is a large positive 
constant, and x = (x,,X,,...,x,)’. As before, it is assumed that: 


a. ọ( x) is continuous in D. 


b. A(x) has continuous first-order and second-order partial derivatives with 
respect to X,,X,,...,x, in D. 


n 
c. h(x) has a single maximum in D at x =t. 


If t is an interior point of D, then it is also a stationary point of h(x), that 
is, dh/dx;|,-, =0, 1=1,2,...,n, since t is a point of maximum and the 
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partial derivatives of A(x) exist. Furthermore, the Hessian matrix, 


d7h(x) | 


OX; OX; 


H,(t) = | 


is negative definite. Then, for large A, J(A) is approximately equal to 


2m)" -1/2 
10) ~(=*] o(t){—det[H,(t)]} 2 e0. 


A proof of this approximation can be found in Wong (1989, Section 9.5). We 
note that this expression is a generalization of Laplace’s formula (12.25) to an 
n-tuple Riemann integral. 

If t happens to be on the boundary of D and still satisfies the conditions 
that dh/dx,;|x-1 = 0 for i=1,2,...,n, and H,(t) is negative definite, then it 
can be shown that for large À, 


1/2mr\" a 
10a) ~5(=] p(t) {—det[H,(t)]} 7? eO, 


which is one-half of the previous approximation for [(A) (see Wong,1989, 
page 498). 


12.8. THE MONTE CARLO METHOD 


A new approach to approximate integration arose in the 1940s as part of the 
Monte Carlo method of S. Ulam and J. von Neumann (Haber, 1970). The 
basic idea of the Monte Carlo method for integrals is described as follows: 
suppose that we need to compute the integral 


= [fF dx. (12.35) 


We consider J as the expected value of a certain stochastic process. An 
estimate of J can be obtained by random sampling from this process, and the 
estimate is then used as an approximation to J. For example, let X be a 
continuous random variable that has the uniform distribution U(a, b) over 
the interval [a, b]. The expected value of f(X) is 


1 
ELA X)| = p= fF) ae 
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Let x,,%5,...,x, be a random sample from U(a, b). An estimate of ELfCXDI 
is given by (1/n)X7_, f(x;). Hence, an approximate value of J, denoted by J,,, 
can be obtained as 


~ b-a ” 


Le te): (12.36) 


The justification for using a as an approximation to J is that Î is a 
consistent estimator of J, that is, for a given e> 0, 


lim P( 


noo 


f,-1|>€) =0. 


This is true because (1/n)X7_, f(x;) converges in probability to E[f(X)] as 
n > ©, according to the law of large numbers (see Sections 3.7 and 5.6.3). In 
other words, the probability that J, will be different from J can be made 
arbitrarily close to zero if n is chosen large enough. In fact, we even have the 
stronger result that 7, converges strongly, or almost surely, to J, as n > œ, by 
the strong law of large numbers (see Section 5.6.3). . 

The closeness of J,, to J depends on the variance of I, which is equal to 


A 2 a 
Var(Í,)= (b-a) Var TEE] 


(b-a of 
Ta (12.37) 
where Of is the variance of the random variable f(X), that is, 
Of = Var| f( X)|] 
=E[F(X)] - {ELIF 
E S I y 
s a-z] x (12.38) 


By the central limit theorem (see Section 4.5.1), if n is large enough, 
then J, is approximately normally distributed with mean J and variance 
(1/n\(b — ayoz. Thus, 


A 


a aa 
>. $ 
(b -a)o,/Vn 


Barat! d 
where Z has the standard normal distribution, and the symbol —> denotes 
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convergence in distribution (see Section 4.5.1). It follows that for a given 
7T> 0, 


P||Îf, -I| < F (b-ayo; e™/ dx. (12.39) 


1 
|- V2T is 
The right-hand side of (12.39) is the probability that a standard normal 
distribution attains values between — 7 and r. Let us denote this probability 
by 1- æ. Then r=z,,., which is the upper (@/2)100th percentile of the 
standard normal distribution. If we denote the error of approximation, [, — J, 
by E,,, then formula (12.39) indicates that for large n, 


|E,|< = (b-a)oy2 a/2 (12.40) 


with an approximate probability equal to 1— a, which is called the confi- 
dence coefficient. Thus, for a fixed a, the error bound in (12.40) is propor- 
tional to = and is inversely proportional to v7 . For example, if 1 — a = 0.90, 
then z, ,. = 1.645, and 


1.645 
vn 


with an Ta confidence coefficient equal to 0.90. Also, if 1 — a = 0.95, 
then z, ,. = 1.96, and 


|E, |< 


(b-a)o; 


1.96 
|E,|< ae ee 


with an approximate confidence coefficient equal to 0.95. 
In order to compute the error bound in (12.40), an estimate of of is 
needed. Using (12.38) and the random sample x4, x5,...,%,, an estimate of 
2 


o; is given by 
12 R 
I Ero- [5 Erl. (12.41) 
i=1 


12.8.1. Variance Reduction 


In order to increase the accuracy of the Monte Carlo approximation of J, the 
error bound in (12.40) should be reduced for a fixed value of a. We can 
achieve this by increasing n. Alternatively, we can reduce ø, by considering a 
distribution other than the uniform distribution. This can be accomplished by 
using the so-called method of importance sampling, a description of which 
follows. 


538 APPROXIMATION OF INTEGRALS 


Let g(x) be a density function that is positive over the interval [a, b]. 
Thus, g(x) >0, a<x<b, and {?9(x) dx =1. The integral in (12.35) can be 
written as 


I= g(x) dx. (12.42) 


In this case, J is the expected value of f(X)/g(X), where X is a continuous 
random variable with the density function g(x). Using now a random sample, 
Xis X23... Xp, from this distribution, an estimate of J can be obtained as 


(12.43) 


The variance of i is then given by 
2 
A fg 
Var( I* ) = —, 
( n ) n 


where 


f(X) 
g(X) 


—AFCOP\ T 
«(|| {=| | 
bf? (x) 
a 8° (x) 


Oe = Var 


g(x)dx-I’. (12.44) 


As before, the error bound can be derived on the basis of the central limit 
theorem, using the fact that /* is approximately normally distributed with 
mean IJ and variance (1 /n)0o;, for large n. Hence, as in (12.40), 


1 
|Er|< Pa 


with an approximate probability equal to 1—a, where E* = re — I. The 
density g(x) should therefore be chosen so that an error bound smaller than 
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Table 12.2. Approximate Values of 7 = {7x7 dx 
Using Formula (12.36) 


n I, 

50 2.3669 
100 2.5087 
150 2.2227 
200 2.3067 
250 2.3718 
300 2.3115 
350 2.3366 


the one in (12.40) can be achieved. For example, if f(x) > 0, and if 


eee ce 


= ; (12.45) 
b 

fI) dx 
then Oe = (Q as can be seen from formula (12.44). 

Unfortunately, since the exact value of {f(x)dx is the one we seek to 
compute, formula (12.45) cannot be used. However, by choosing g(x) to 
behave approximately as f(x) [assuming f(x) is positive], we should expect 
a reduction in the variance. Note that the generation of random variables 
from the g(x) distribution is more involved than just using a random sample 
from the uniform distribution U(a, b). 


EXAMPLE 12.8.1. Consider the integral I= ffx? dx= 4 = 2.3333. Sup- 
pose that we use a sample of 50 points from the uniform distribution U(, 2). 
Applying formula (12.36), we obtain J,) = 2.3669. Repeating this process 
several times using higher values of n, we obtain the values in Table 12.2. 


EXAMPLE 12.8.2. Let us now apply the method of importance sampling 
to the integral J = {je* dx = 1.7183. Consider the density function g(x) = 
za +x) over the interval [0, 1]. Using the method described in Section 3.7, a 
random sample, x4, X3,...,X„, can be generated from this distribution as 
follows: the cumulative distribution function for g(x) is y = G(x) = PLX <x], 
that is, 


y= G(x) = f'g(1) dt 


IKEDE 


ll 


ll 
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Table 12.3. Approximate Values of J = /je* dx 


n fe [Formula (12.46)] Î, [Formula (12.36)] 
50 1.7176 1.7156 

100 1.7137 1.7025 

150 1.7063 1.6854 

200 1.7297 1.7516 

250 1.7026 1.6713 

300 1.7189 1.7201 

350 1.7093 1.6908 

400 1.7188 1.7192 


The only solution of y = G(x) in [0,1] is 


x=-1+(1+3y), 0<y<1. 


Hence, the inverse function of G(x) is 


G!(y)=-1+(1+3y), 0<y<1. 


If y,,¥.,---,¥, form a random sample of n values from the uniform 
distribution U(0,1), then x,=G~'(y,), x, = G7! (y,),..., x, =G'Q,) will 
form a sample from the distribution with the density function g(x). Formula 
(12.43) can then be applied to approximate the value of J using the estimate 


i 


e 
1+x, 


L 


se 


3 n 
= — 12.46 
ee (12.46) 


Table 12.3 gives bs for several values of n. For the sake of comparison, 
values of J,, from formula (12.36) [with f(x) =e*, a=0, b= 1] were also 
computed using a sample from the uniform distribution U(0, 1). The results 
are shown in Table 12.3. We note that the values of I% are more stable and 
closer to the true value of J than those of Z. 


12.8.2. Integrals in Higher Dimensions 


The Monte Carlo method can be extended to multidimensional integrals. 
Consider computing the integral 7 = fp f(x) dx, where D is a bounded region 
in R”, the n-dimensional Euclidean space, and x = (x,,%5,...,x,,)’. As be- 
fore, we consider J as the expected value of a stochastic process having a 
certain distribution over D. For example, we may take X to be a continuous 
random vector uniformly distributed over D. By this we mean that the 
probability of X being in D is 1/v(D), where v(D) denotes the volume of D, 
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and the probability of X being outside D is zero. Hence, the expected value 
of f(X) is 


1 
v(D) 
I 
v( D) ` 


E[f(X)] J PQ) dx 


The variance of f(X), denoted by Of» is 


of = E[f?(X)] -ELIF 


1 y I Ý 
v(D) Jf (x) dx — coy | í 


Let us now take a sample of N independent observations on X, namely, 
X,,X5,...,X,. Then a consistent estimator of E[f(X)] is G/NEAN f(X;), 
and hence, an estimate of J is given by 


iy= n Ls), 


which can be used to approximate J. The variance of Î y IS 
A Of 
Var( Iy) = —v’( D). 12.47 
(iy) = D) (12.47) 


If N is large enough, then by the central limit theorem, Î y is approximately 
normally distributed with mean 7 and variance as in (12.47). It follows that 


v(D) 
N TA 


with an approximate probability equal to 1— a, where Ey =I =f is the 
error of approximation. This formula is analogous to formula (12.40). 

The method of importance sampling can also be applied here to reduce 
the error of approximation. The application of this method is similar to the 
case of a single-variable integral as seen earlier. 


|En] < 


12.9. APPLICATIONS IN STATISTICS 


Approximation of integrals is a problem of substantial concern for statisti- 
cians. The statistical literature in this area has grown significantly in the last 
20 years, particularly in connection with integrals that arise in Bayesian 
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statistics. Evans and Swartz (1995) presented a survey of the major tech- 
niques and approaches available for the numerical approximation of integrals 
in statistics. The proceedings edited by Flournoy and Tsutakawa (1991) 
includes several interesting articles on statistical multiple integration, includ- 
ing a detailed description of available software to compute multidimensional 
integrals (see the article by Kahaner, 1991, page 9). 


12.9.1. The Gauss-Hermite Quadrature 


The Gauss—Hermite quadrature mentioned earlier in Section 12.5 is often 
used for numerical integration in statistics because of its relation to Gaussian 
(normal) densities. We recall that this quadrature is defined in terms of 
integrals of the form f° „e™* f(x) dx. Using formula (12.20), we have approxi- 
mately 


[ea d= Dols) 


where the x,’s are the zeros of the Hermite polynomial H,,,,(x) of degree 
n+ 1, and the œs are suitably corresponding weights. Tables of x; and a, 
values are given by Abramowitz and Stegun (1972, page 924) and by Krylov 
(1962, Appendix B). 

Liu and Pierce (1994) applied the Gauss—Hermite quadrature to integrals 
of the form f? g(t) dt, which can be expressed in the form 


fa) a J IODH, po) dt, 


where (t, m, o) is the normal density 


1 1 
A(t, M, o) = el- jaN n|, 


and f(t) =g(t)/ (t, u, o). Thus, 
o 2 1 1 > 
f s(a-= e - ote u) |a 
1 os 2 
= zÍ. f( u+ V2ox)e™ dx 


! £ wf( u+ V20x,), (12.48) 


Vm ix 
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where the x,’s are the zeros of the Hermite polynomial H,,,,(x) of degree 
n+ 1. We may recall that the x;s and w,’s are chosen so that this approxima- 
tion will be exact for all polynomials of degrees not exceeding 2n + 1. For 
this reason, Liu and Pierce (1994) recommend choosing u and o in (12.48) 
so that f(t) is well approximated by a low-order polynomial in the region 
where the values of u+ V2 ox; are taken. More specifically, the (n + 1)th- 
order Gauss—Hermite quadrature in (12.48) will be highly effective if the 
ratio of g(t) to the normal density (t, u, o°) can be well approximated by a 
polynomial of degree not exceeding 2n +1 in the region where g(t) is 
substantial. This arises frequently, for example, when g(t) is a likelihood 
function [if g(t) > 0], or the product of a likelihood function and a normal 
density, as was pointed out by Liu and Pierce (1994), who gave several 
examples to demonstrate the usefulness of the approximation in (12.48). 


12.9.2. Minimum Mean Squared Error Quadrature 


Correlated observations may arise in some experimental work (Piegorsch and 
Bailer, 1993). Consider, for example, the model 


Yar =f(tq) F Eg, 


where y,, represents the observed value from experimental unit r at time t} 
(q =0,1,...,m; r=1,2,...,n), ft) is the underlying response function, and 
€,, is a random error term. It is assumed that E( Egr) = 0, Cov ens Egr) =j 
and Cov(e,,, €,;) =0 (r +s) for all p,q. The area under the response curve 
over the interval tọ <t < tm is 


A= ['"f(t) de. (12.49) 


0 


This is an important measure for a variety of experimental situations, 
including the assessment of chemical bioavailability in drug disposition stud- 
ies (Gibaldi and Perrier, 1982, Chapters 2,7) and other clinical settings. If the 
functional form of f(-) is unknown, the integral in (12.49) is estimated by 
numerical methods. This is accomplished using a quadrature approximation 
of the integral in (12.49). By definition, a quadrature estimator of this 
integral is 


Â= È hf (12.50) 
q=0 


where ÔÊ is some unbiased estimator of fi = ft) with Cow f, f) = 0,,/^, 
and the ¢,’s form a set of quadrature coefficients. 
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The expected value of A, 
E(A)= ¥ bihi 
q=0 


may not necessarily be equal to A due to the quadrature approximation 
employed in calculating A. The bias in estimating A is 


bias = E( A) —A, 
and the mean squared error (MSE) of A is given by 
MSE( A) = Var Â+ [E(A) -A| 


- E Ehh 


p=0 q=0 


3 T | 


q=0 


which can be written as 
R 1 j 
MSE( A) = -Vo + (f’6 — A)’, 
n 
where V=(o,,), £= (fos fiss-+> fim), and ' = (ho, Pis- -> Pm). Hence, 
1 
—V+ff' 
n 


MSE( A) = 6! b —2 Af +A. (12.51) 


Let us now seek an optimum value of that minimizes MSECA) in (12.51). 
For this purpose, we equate the gradient of MSE(A) with respect to o, 
namely V,,MSE( A), to zero. We obtain 


n 1 
V,MSE( Â -o ver 


b -arl =0. (12.52) 


In order for this equation to have a solution, the vector Af must belong to 
the column space of the matrix (1/n)V + ff’. Note that this matrix is positive 
semidefinite. If its inverse exists, then it will be positive definite, and the only 
solution, ¢*, to (12.52), namely, 


-1 


1 
b* -4| —V+ff'| f, (12.53) 
n 


yields a unique minimum of MSE(A) (see Section 7.7). Using * in (12.50), 
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we obtain the following estimate of A: 


f. (12.54) 
Using the Sherman-Morrison formula (see Exercise 2.14), (12.54) can be 


written as 


nf vife Vif 
1+nf' Vf 


A 


A* =nA| F? Vf- l (12.55) 


where f= (f, fis... f,Y. We refer to A* as an MSE-optimal estimator of 
A. Replacing f with its unbiased estimator f, and A with some initial 
estimate, say Ay=f'do: where ọọ is an initial value of @, yields the 
approximate MSE-optimal estimator 


x c 
A** = —— 4,, 12.56 
fee. ( ) 


where c is the quadratic form, 
c=nf Vf, 


This estimator has the form f’, where 6 =[c/(1 + c)lb,. Since c = 0, A** 
provides a shrinkage of the initial estimate in A, toward zero. This creates a 
biased estimator of A with a smaller variance, which results in an overall 
reduction in MSE. A similar approach is used in ridge regression to estimate 
the parameters of a linear model when the columns of the corresponding 
matrix X are multicollinear (see Section 5.6.6). 

We note that this procedure requires knowledge of V. In many applica- 
tions, it may not be possible to specify V. Piegorsch and Bailer (1993) used an 
estimate of V when V was assumed to have certain particular patterns, such 
as V=oa"I or V=o7[( — p) + pJ], where I and J are, respectively, the 
identity matrix and the square matrix of ones, and a” and p are unknown 
constants. For example, under the equal variance assumption V = oI, the 
ratio c/(1 +c) is equal to 

a? ha 
14 | | 


nf'f 


1+c 


If fis normally distributed, f~N [f, (1 /n)V]—as is the case when f is a vector 
of means, f= (Fo, Yi.. -> Ym) with Y, =(1/n)X;-1 Yı then an unbiased 
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estimate of o? 


is given by 
2 1 . 5% 
S oD 2 (Yor — 7q) : 


HEM OAT) oye 


2 in (12.56), we obtain the area estimator 


s2 \ 
1+ A k 
nf 


A similar quadrature approximation of A in (12.49) was considered earlier 
by Katz and D’Argenio (1983) using a trapezoidal approximation to A. The 
quadrature points were selected so that they minimize the expectation of the 
square of the difference between the exact integral and the quadrature 
approximation. This approach was applied to simulated pharmacokinetic 
problems. 


Substituting s? in place of o 


APSA; 


12.9.3. Moments of a Ratio of Quadratic Forms 


Consider the ratio of quadratic forms, 


"™ 
Q= A = (12.57) 
y By 


where A and B are symmetric matrices, B is positive definite, and y is an 
nX1 random vector. Ratios such as Q are frequently encountered in 
statistics and econometrics. In general, their exact distributions are mathe- 
matically intractable, especially when the quadratic forms are not indepen- 
dent. For this reason, the derivation of the moments of such ratios, for the 
purpose of approximating their distributions, is of interest. Sutradhar and 
Bartlett (1989) obtained approximate expressions for the first four moments 
of the ratio Q for a normally distributed y. The moments were utilized to 
approximate the distribution of Q. This approximation was then applied to 
calculate the percentile points of a modified F-test statistic for testing 
treatment effects in a one-way model under correlated observations. 
Morin(1992) derived exact, but complicated, expressions for the first four 
moments of Q for a normally distributed y. The moments are expressed in 
terms of confluent hypergeometric functions of many variables. 

If y is not normally distributed, then no tractable formulas exist for the 
moments of Q in (12.57). Hence, manageable and computable approxima- 
tions for these moments would be helpful. Lieberman (1994) used the 
method of Laplace to provide general approximations for the moments of Q 
without making the normality assumption on y. Lieberman showed that if 
E(y'By) and E[(y’Ay)*] exist for k>1, then the Laplace approximation of 
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E(Q*), the kth noncentral moment of Q, is given by 


E|(y'Ay)*| 


ne) [E(y'By)]* 


(12.58) 


In particular, if y ~ N( u, %), then the Laplace approximation for the mean 
and second noncentral moment of Q are written explicitly as 


tr(AL) + wAp 


E,(Q) = tr(BS) + wBp 
„n Eloa] 
oO Tiya 
_ Var(y'Ay) + [E(y'Ay)]? 
[E(y'By)]? 


_ W|(A2)?] + 4wALAp + [tr(A£) + pAg]? 
[(BS) + WBa] 


(see Searle, 1971, Section 2.5 for expressions for the mean and variance of a 
quadratic form in normal variables). 


EXAMPLE 12.9.1. Consider the linear model, y = XB +e, where X is 
nXp of rank p, B is a vector of unknown parameters, and e is a random 
error vector. Under the null hypothesis of no serial correlation in the 
elements of e, we have e ~N(0,o07I). The corresponding Durbin-Watson 
(1950, 1951) test statistic is given by 


e’PA Pe 


e'Pe 


Fi 


where P = I — X(X'X) ~!X’, and A, is the matrix 


oei o 0 = 6. 
= a ae rn ee | A 
O 2) “Sat cee 70. 
gal |e ne a E R a 
G Orcas Sh ed 10 
O. 0) a e a A ed 


0 Oc 0 0 =l 


548 APPROXIMATION OF INTEGRALS 
Then, by (12.58), the Laplace approximation of E(d) is 


E(e’PA,Pe) 


E,(4) = E(e’Pe) 


(12.59) 


Durbin and Watson (1951) showed that d is distributed independently of its 
own denominator, so that the moments of the ratio d are the ratios of the 
corresponding moments of the numerator and denominator, that is, 


E|(€'PA,Pe)*| 


E(d*‘) = E[(e'Pe)'] 


(12.60) 


From (12.59) and (12.60) we note that the Laplace approximation for the 
mean, E(d), is exact. For k > 2, Lieberman (1994) showed that 


we =1+ of -). 


Thus, the relative error of approximating higher-order moments of d is 
O(1/n), regardless of the matrix X. 


12.9.4. Laplace’s Approximation in Bayesian Statistics 


Suppose (Kass, Tierney, and Kadane, 1991) that a data vector y= 
(y1,¥25---»¥,) has a distribution with the density function p(y| 0), where 0 
is an unknown parameter. Let L(@) denote the corresponding likelihood 
function, which is proportional to p(y|@). In Bayesian statistics, a prior 
density, (0), is assumed on 0, and inferences are based on the posterior 
density q(@ly), which is proportional to L(@)7r(@), where the proportionality 
constant is determined by requiring that q(6ly) integrate to one. For a given 
real-valued function g(@), its posterior expectation is given by 


[s(@)L(0) (0) do 
E[s(9)\y] = : (12.61) 
[L(0) (0) do 


Tierney, Kass, and Kadane (1989) expressed the integrands in (12.61) as 
follows: 


g(0)L(0) (0) =by(0) exp[—nhy()], 
L(0) (0) =bp(4) exp[ —nhp(4)], 
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where b,(@) and b,(@) are smooth functions that do not depend on n and 
hy(@) and hp(@) are constant-order functions of n, as n>. Formula 
(12.61) can then be written as 


[ov (9) exp[—nh,(@)] d0 


[b0(9) exp[—nhp(9)]40 


E|g(@)|y] = (12.62) 


Applying Laplace’s approximation in (12.25) to the integrals in (12.62), we 
obtain, if n is large, 


h'p( 8p) | by (ôv) exp] -nhu (êv )| , (12.63) 


Els(0) ly] = |- z 

h'y (ôx) bp( 4p) exp| -nhp(ô»)| 
where by and ôn are the locations of the single maxima of —h,(6) and 
—hp(0), respectively. In particular, if we choose hy(0)= hp(0) = 
—(/n)logl L(@)7(0)], by(@) = g(@) and bp)(@) = 1, then (12.63) reduces to 


E[g(9)ly] ~s(4), (12.64) 


where 6 is the point at which (1/n)log| L(0)r(0)] attains its maximum. 

Formula (12.64) provides a first-order approximation of Elg(0)|y]. This 
approximation is often called the modal approximation because 0 is the mode 
of the posterior density. A more accurate second-order approximation of 
Elg(@) ly] was given by Kass, Tierney, and Kadane (1991). 


12.9.5. Other Methods of Approximating Integrals in Statistics 


There are several major techniques and approaches available for the numeri- 
cal approximation of integrals in statistics that are beyond the scope of this 
book. These techniques, which include the saddlepoint approximation and 
Markov chain Monte Carlo, have received a great deal of attention in the 
Statistical literature in recent years. 

The saddlepoint method is designed to approximate integrals of the 
Laplace type in which both the integrand and contour of integration are 
allowed to be complex valued. It is a powerful tool for obtaining accurate 
expressions for densities and distribution functions. A good introduction to 
the basic principles underlying this method was given by De Bruijn (1961, 
Chapter 5). Daniels (1954) is credited with having introduced it in statistics in 
the context of approximating the density of a sample mean of independent 
and identically distributed random variables. 

Markov chain Monte Carlo (MCMC) is a general method for the simula- 
tion of stochastic processes having probability densities known up to a 
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constant of proportionality. It generally deals with high-dimensional statisti- 
cal problems, and has come into prominence in statistical applications during 
the past several years. Although MCMC has potential applications in several 
areas of statistics, most attention to date has been focused on Bayesian 
applications. 

For a review of these techniques, see, for example, Geyer (1992), Evans 
and Swartz (1995), Goutis and Casella (1999), and Strawderman (2000). 
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EXERCISES 
In Mathematics 


12.1. Consider the integral 
1, = “lo xdx. 
J tog 


It is easy to show that 


I, =nlogn-n-+1. 


(a) Approximate J,, by using the trapezoidal method and the partition 
points x, =1, x, =2,...,x, =n, and verify that 


I, ~ log(n!) — slog n. 


(b) Deduce from (a) that n! and n”*!/7e~” are of the same order of 
magnitude, which is essentially what is stated in Stirling’s formula 
(see Example 12.6.1) 


12.2. Obtain an approximation of the integral /jdx/(1 +x) by Simpson’s 
method for the following values of n: 2, 4, 8,16. Show that when n = 8, 
the error of approximation is less than or equal to 0.000002. 
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12.3. 


12.4. 


12.6. 


12.7. 


12.8. 


Use Gauss—Legendre quadrature with n = 2 to approximate the value 


of the integral {7/* sin 0d0. Give an upper bound on the error of 


approximation using formula (12.16). 


(a) Show that 
2 2 1 2 
pen dx < —e™ , m>Q. 


[Hint: Use the inequality e~** < e™* for x >m.] 

(b) Find a value of m so that the upper bound in (a) is smaller than 
107. 

(c) Use part (b) to find an approximation for fe™™* dx correct to 
three decimal places. 


. Obtain an approximate value for the integral /{x(e*+e* —1)7! dx 


using the Gauss-Laguerre quadrature. [ Hint: Use the tables in Ap- 
pendix C of the book by Krylov (1962, page 347) giving the zeros of 
Laguerre polynomials and the corresponding values of o;.] 


Consider the indefinite integral 
x dt 
I(x) = J nae Arctan x. 


(a) Make an appropriate change of variables to show that I(x) can be 
written as 


I(x) =2 1 du 
X)=2£X 
reas +1) 


(b) Use a five-point Gauss—Legendre quadrature to provide an ap- 
proximation for (x). 


Investigate the asymptotic behavior of the integral 
1 
1(A) = f (cos x)" dx 
0 
as A> o, 
(a) Use the Gauss-Laguerre quadrature to approximate the integral 


Jge~°* sin xdx using n= 1,2,3,4, and compare the results with 
the true value of the integral. 
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(b) Use the Gauss-Hermite quadrature to approximate the integral 
[..|xlexp(—3x7) dx using n = 1,2,3,4, and compare the results 


with the true value of the integral. 


12.9. Give an approximation to the double integral 
1 a- 2)1/2 1/2 
ST “D (1 -x?—x2) ^ de, dx, 


by applying the Gauss—Legendre rule to formulas (12.32) and (12.33). 


12.10. Show that [$ +x°)™” dx is asymptotically equal to (r/n) as 


n > o, 


In Statistics 


12.11. Consider a sample, X,, X,,...,X,,, of independent and indentically 
distributed random variables from the standard normal distribution. 
Suppose that n is odd. The sample median is the (m + 1)th order 
statistic X(,41), Where m = (n — 1)/2. It is known that X,,,,1) has 
the density function 


Fa) =a" p EO- O], 


where 


1 xe 

p(x ) ~ VIr exp 2 

is the standard normal density function, and 
D(x)=f p(t)adi 


(see Roussas, 1973, page 194). Since the mean of X/,,,,) is zero, the 
variance of X(,,,1) is given by 


Var| Xim+1| =n(" > l | eera [1 — #(x)]” p(x) dx. 


Obtain an approximation for this variance using the Gauss-Hermite 
quadrature for n = 11 and varying numbers of quadrature points. 
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12.12. 


12.13. 


12.14. 


(Morland, 1998.) Consider the density function 


—x? /2 


1 
(x) = Bn 


for the standard normal distribution. Show that if x> 0, then the 
cumulative distribution function, 


(x) = -2/2 dt, 


1 oa 
see 


can be represented as the sum of the series 


te * pare Se a 
2 Vm 6 40 336 
x8 x?” 
rae O Gaon” 


[Note: By truncating this series, we can obtain a polynomial approxi- 
mation of ®(x). For example, we have the following approximation of 
order 11: 


é 1 x i x? x4 x® x8 x! 
= + -—+—-— + - ; 
ON 5 t Tar 6 +m 336° 3456 Ma0] 


Use the result in Exercise 12.12 to show that there exist constants 
a, b,c, d, such that 


Š 1 x 1+ax?+ bx’ 
=—+ e 
Soo eae E e e 


Apply the method of importance sampling to approximate the value of 
the integral {je* dx using a sample of size n = 150 from the distribu- 
tion whose density function is g(x) = 4(1+x), 0<x <2. Compare 
your answer with the one you would get from applying formula 
(12.36). 
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12.15. Consider the density function 


n+1 
E 


OERA 
var (5) 


1+ 


n 


for a t-distribution with n degrees of freedom (see Exercise 6.28). Let 
F(x)= f*„f(t)dt be its cumulative distribution function. Show that 
for large n, 


F(x) = ade 


1 x 
— e 
y 2 T a o0 
which is the cumulative distribution function for the standard normal. 
[Hint: Apply Stirling’s formula.] 


APPENDIX 


Solutions to Selected Exercises 
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1.3. 


1.4. 


1.5. 


1.6. 


1.7. 


1.8. 


AUCCB implies A CB. Hence, A =A N B. Thus, ANB C C implies 
A CC. It follows that A N C = Ø. 


(a) x€A4B implies that xE&4 but €B, or xEB but €A; thus 
x€AUB-ANB. Vice versa, if x ©€4AUB-ANQB, then 
xEAaB. 

(c) x€AN(BaD) implies that x <A and x © BaD, so that either 
xE€ANB but €AND, or xE AND but €ANB, so that x€ (A 
NB)a (AAD). Vice versa, if x €(ANB)4(A ND), then either 
x€ANB but €AND, or xe© AND but AAB, so that either 
x is in A and B but ¢D, or x is in A and D, but ¢ B; thus 
xEAN(BaD). 


It is obvious that p is reflexive, symmetric, and transitive. Hence, it is 
an equivalence relation. If (m,,n,) is an element in A, then its 
equivalence class is the set of all pairs (m,n) in A such that m/m = 
n/Nno, that is, m/n =m,)/Nyg. 


The equivalence class of (1,2) consists of all pairs (m,n) in A such that 
m-n=-—1. 


The first elements in all four pairs are distinct. 


(a) If yef(U%_,A,), then y=f(x), where xe U"_,A;. Hence, if 
x €A, and f(x) €f(A,;) for some i, then f(x) © U'_, f(4;); thus 
f(U%,;4)) U f(A;). Vice versa, it is easy to show that 
Ui f(A) CFU f- A)). 

(b) If yef(N714;), then y=f(x), where x€A; for all i; then 
f(x) E€ f(A) for all i; then f(x) € N7- f(4;). Equality holds if f is 
one-to-one. 
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1.13. 


1.20. 


1.21. 
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. Define f:J*—>A such that f(n)=2n?+1. Then f is one-to-one and 


onto, so A is countable. 


a+ Vb =c+Vd =a—c=Vd — vb. If a=c, then b =d. If a +c, then 
Vd — yb is a nonzero rational number and Vd + Vb =(d—b)/(Vd 
— vb). It follows that both Vd — yb and Vd + Vb are rational num- 
bers, and therefore vb and Vd must be rational numbers. 


. Let g=inf(A). Then g <x for all x in A, so —g > —x, hence, —g is 


an upper bound of —A and is the least upper bound: if —g’ is any 
other upper bound of —A, then —g’> —x, so x>g’, hence, g’ is a 
lower bound of A, so g’ <g, that is, —g' > —g, so —g = sup(—A). 


. Suppose that b A. Since b is the least upper bound of A, it must be a 


limit point of A: for any e> 0, there exists an element a €A such that 
a >b — e. Furthermore, a <b, since b € A. Hence, b is a limit point of 
A. But A is closed; hence, by Theorem 1.6.4, b € A. 


. Suppose that G is a basis for Z, and let p € B. Then B = U,U,, where 


a~a? 


U, belongs to G. Hence, there is at least one U, such that p € U, CB. 
Vice versa, if for each B © ¥ and each p €B there is a U € G such 
that p © UCB, then G must be a basis for Z: for each p E€ B we can 
find a set U, € G such that p € U, CB; then B= U{U,| p € B}, so G is 
a basis. 


. Let p be a limit point of A U B. Then p is a limit point of A or of B. 


In either case, p EA U B. Hence, by Theorem 1.6.4, A UB is a closed 
set. 


. Let {C,} be an open covering of B. Then B and {C,} form an open 


covering of A. Since A is compact, a finite subcollection of the latter 
covering covers A. Furthermore, since B does not cover B, the mem- 
bers of this finite covering that contain B are all in {C,}. Hence, B is 
compact. 


No. Let (A,¥) be a topological space such that A consists of two 
points a and b and ¥ consists of A and the empty set ©. Then A is 
compact, and the point a is a compact subset; but it is not closed, since 
the complement of a, namely b, is not a member of Z, and is therefore 
not open. 


(a) Let w EA. Then X(w) <x<x+3-" for all n, so wEN%_,A,,. 
Vice versa, if w € N-14, then X(w)<x+3™ for all n. To 
show that X(w) <x: if X(w) >x, then X(w) >x +37” for some n, 
a contradiction. Therefore, X(w) <x, and w EA. 
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(b) Let w € B. Then X(w) <x, so X(w)<x-— 37” for some n, hence, 
w€ U*_,B,,. Vice versa, if w E U*_,B,, then X(w) <x —3~" for 
some n, so X(w) <x: if X(w) > x, then X(w) >x —3~" for all n, a 
contradiction. Therefore, w € B. 


1.22. (a) P(X > 2) <A/2, since u= À. 


(b) 
P(X>=2)=1-p(X<2) 


=1-p(0)-p(1) 
=1—e™™(à+1). 


To show that 
À 
1-e(A+1)< mE 


Let (A) =A/2+e (A+ 1)— 1. Then 
i. (0) = 0, and 
ii, f(A) = 5 —Ae~* > 0 for all A. 
From (i) and (ii) we conclude that (A) > 0 for all A> 0. 


1.23. (a) 


p(X- wl) 2c) =P[(X- u =e? | 


o? 


< — 
C 


2 


by Markov’s inequality and the fact that E(X — u?’ =o. 
(b) Use c =ko in (a). 


(c) 
P(|X— p| <ko) =1—-P(|X-p| =ko) 
1 
>1- oe 


1.24. w=ECX)= [1 x- |x|) de =0, o? = f1 x°- |x|) de = ż. 
(a) 


(b) 
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1.25. (a) True, since Xa) >x if and only if X, >x for all i. 
(b) True, since X, <x if and only if X, <x for all i. 
(c) 


P(Xq <x) =1-P(Xq >x) 
=1-[1-F(x)]’. 
(d) P(X, <x) = LE. 
1.26. PQ < Xa <3) =P(Xq <3) — P(X <2). Hence, 
P(2<Xq <3) =1-[1-F(3) -1+ [1 -F2 
=F =r: 
But F(x) = [i2e~?! dt =1—e-?*. Hence, 


P(2<Xq) <3) = (et) - (e6) 
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2.1. If m >n, then the rank of the n Xm matrix U =[u,:u,:---:u,,] is less 
than or equal to n. Hence, the number of linearly independent columns 
of U is less than or equal to n, so the m columns of U must be linearly 
dependent. 


2.2. If u,,u,,...,u, and v are linearly dependent, then v must belong to W, 
a contradiction. 


2.5. Suppose that u,,u,,...,u, are linearly independent in U, and that 
L_a; T(u;,)=0 for some constants a1, a,...,a@,. Then T(X}_, a;u;) 
=0. If T is one-to-one, then L?_,a,u;= =0, hence, a,=0 for 
all i, since the u,’s are linearly independent. It follows that T(u,), 
T(u,),...,7(u,,) must also be linearly independent. Vice versa, let 
u EU such that T(u)=0. Let ej,e,,...,e, be a basis for U. Then 
u = X}; Tje; for some constants 7,,75,...,7,. Tw =0 = X_,7,T(e;) 
=0 = 7,=0 for all i, since T(e,), T(e,),..., T(e,,) are linearly inde- 
pendent. It follows that u = 0 and T is one-to-one. 


2.7. If A=(a;;), then tr(AA)= X;}; a? 


jaj Hence, A=0 if and only if 
tr(A4A) = 0. 
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2.8. 


2.9. 


2.10. 


2.11. 


2.12. 


2.13. 


It is sufficient to show that Av = 0 if v'Av = 0. If v'Av = 0, then A! v = 0, 
and hence Av = 0. [ Note: A! is defined as follows: since A is symmet- 
ric, it can be written as A= PAP’ by Theorem 2.3.10, where A is a 
diagonal matrix whose diagonal elements are the eigenvalues of A, and 
P is an orthogonal matrix. Furthermore, since A is positive semidefinite, 
its eigenvalues are greater than or equal to zero by Theorem 2.3.13. 
The matrix A'”* is defined as PA'/’P’, where the diagonal elements of 
A'” are the square roots of the corresponding diagonal elements 
of A.] 


By Theorem 2.3.15, there exists an orthogonal matrix P and diagonal 
matrices A,, A, such that A= PA,P’, B=PA,P’. The diagonal ele- 
ments of A, and A, are nonnegative. Hence, 


AB =PA,A,P’ 


is positive semidefinite, since the diagonal elements of A,A, are 
nonnegative. 


Let C = AB. Then C’ = — BA. Since tr(AB) = tr(BA), we have tr(C) = 
tr(— C’) = —tr(C) and thus tr(C) = 0. 


Let B = A — A. Then 
tr(B'B) = tr|(4 — A)(A—A)] 

=tr[(A — A)A] —tr[(A — A)A] 

=tr[(A — A)A] —tr[A(A-A)] 

= tr[(A — A)A] — tr[(A—A)A] 

= 2tr[(A — A)A] =0, 
since A — A is skew-symmetric. Hence, by Exercise 2.7, B = 0 and thus 
A=A. 
This follows easily from Theorem 2.3.9. 


By Theorem 2.3.10, we can write 
A— AL, =P(A—AI,)P’, 


where P is an orthogonal matrix and A is diagonal with diagonal 
elements equal to the eigenvalues of A. The diagonal elements of 
A — àl, are the eigenvalues of A— AI. Now, k diagonal elements of 
A—AI,, are equal to zero, and the remaining n—k elements must 
be different from zero. Hence, A — AI, has rank n — k. 
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2.17. If AB = BA =B, then (A—B)? = A4? — AB — BA + B’ =A — 2B+ B= 
A — B. Vice versa, if A—B is E then A — B = (A — B}? = 
A? — AB — BA + B? = A — AB — BA + B. Thus, 


AB + BA = 2B, 
so AB + ABA = 2AB and thus AB = ABA. We also have 
ABA + BA = 2BA, hence, BA = ABA. 
It follows that AB = BA. We finally conclude that 
B = AB = BA. 


2.18. If A is an n Xn orthogonal matrix with determinant 1, then its eigen- 
values are of the form e+'*', e*!*,...,e*!%, 1, where 1 is of multiplic- 
ity n—2q and none of the real numbers ¢; is a multiple of 27 
(j =1,2,...,q) (those that are odd multiples of 7 give an even number 
of eigenvalues equal to — 1). 


2.19. Using Theorem 2.3.10, it is easy to show that 


where the inequality on the right means that e,,,,(A)I,, — A is positive 
semidefinite, and the one on the left means that A—e,,,(AJI, is 
positive semidefinite. It follows that 


min 


enn ( ALUL < LAL < e pax (ALL 


Hence, by Theorem 2.3.19(1), 
emin (A) tr(L L) < tr(LAL) < € na (A) tr(L L). 


2.20. (a) We have that A>e,,,,(A)L, by Theorem 2.3.10. Hence, L'AL > 
enin AL L, and therefore, enin L'AL) > e pinlA)e nin L L) by Theorem 


2.3.18 and the fact that e,,,,(A) > 0 and L'L is nonnegative definite. 
(b) This is similar to (a). 


2.21. We have that 
Cmin(B)A < Al” BA”? < e mar (B)A, 
since A is nonnegative definite. Hence, 


e an (B) tr(A) < tr(A2 BA?) < e „a (B) tr(A). 


E max 
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2.23. (a) 


(b) 


(c) 


2.24. (a) 


(b) 


The result follows by noting that 
tr(AB) = tr(A’’?BA’”’). 


A=I,—-(/n)J,, where J, is the matrix of ones of order n Xn. 
A’ =A, since J? =nJ,,. The rank of A is the same as its trace, which 
is equal to n — 1. 

(n — 1)s*/o? =(1/a)y'Ay ~ x7_;, since (1/o7)A(o7I,) is idem- 
potent of rank n — 1, and the noncentrality parameter is zero. 


1 
Cov( ¥, Ay) = Cov —I',y, ay) 
n 


1 
E 1 2 
= F I,(o I,)A 


o? 
— TA 
n 


=0'. 


Since y is normally distributed, both y and Ay, being linear trans- 
formations of y, are also normally distributed. Hence, they must be 
independent, since they are uncorrelated. We can similarly show 
that y and A! y are uncorrelated, and hence independent, by the 
fact that 1,A = 0! and thus 1,A1,, = 0, which means 1’, A‘? = 0’. It 
follows that y and y’A!/*A'/’y = y'Ay are independent. 


The model is written as 
y=Xg+e, 


where X=[1y: © ,1,,], © denotes the direct sum of matrices, 
g=( u, a), a),...,a,), and N= X_n; Now, u+ a; is estimable, 
since it can be written as a’,g, where a’, is a row of X, i= 1,2,...,a. 
It follows that a;— a, = (a';—a',)g is estimable, since the vector 
a’, — a, belongs to the row space of X. 


Suppose that u is estimable; then it can be written as 


a 
w= } T( u+ ai), 
i=1 


where 7,7,,...,7, are constants, since u + œi, MF @3,..., UF a, 
form a basic set of estimable linear functions. We must therefore 
have Lf_,7, = 1, and 7; = 0 for all i, a contradiction. Therefore, u is 
nonestimable. 
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2.25. (a) X(X'X) X'X(X'X) X’ = X(X'X) X’ by Theorem 2.3.3(2). 
(b) E(/'y)=NB =/'XB = NB. Hence, N =7'X. Now, 
Var( NÊ) =N(X'X) X'X(X'X) do? 
=/'X(X'X) X'X(X'X) XZo? 
=/'X(X'X) X'Zo?’, by Theorem 2.3.3(2) 
<l'lo?, 


since I,, — X(X'X)"X’ is positive semidefinite. The result follows 
from the last inequality, since Var(/' y) =7'Zo’, 


2.26. BX =P(A+KI,)"'P’X'y. But 6B =PA7'P’X’y, so AP'B=P’X’y. 
Hence, 


=4 JA 
B*=P(A+kI,) APB 


= PDP’. 


2.27. 
sup p* = sup (v'C;!AC!3)” 


sup (7'bb'z)} 

= sup {e max(bb')}, by applying (2.9) 
= sup {ema (b'b)} 

= sup {v’C; ‘AC; 7A CI ' v} 

=€ max (CI 'ACZ’ACT!), by (2.9) 
= e max (CI “AC; 7A’) 


= e max (By ‘AB; 'A’). 
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CHAPTER 3 


3.1. (a) 


= lim (1 +x +x? +x? +x‘) 
1 x>1 
=5. 


(b) |xsin(1/x)|<|x| = lim, „o xsin(1/x) =0. 

(c) The limit does not exist, since the function is equal to 1 except 
when sin(1/x)=0, that is, when x= F1/r, ¥1/27,..., F 
1/nz,.... For such values, the function takes the form 0/0, and is 
therefore undefined. To have a limit at x = 0, the function must be 
defined at all points in a deleted neighborhood of x = 0. 

(d) lim, ,9- f(x) =4 and lim, „o+ f(x) =1. Therefore, the function 
f(x) does not have a limit as x > 0. 


3.2. (a) To show that (tan x*)/x? >0 as x> 0: (tan x?)/x? = 
(1/cos x*\(sin x?)/x?. But |sin x7|<|x3|. Hence, |(tan x°) /x?|< 
|x| /|cos x3] > 0 as x > 0. 
(b) x/ Vx >0 as x 30. 
(c) O(1) is bounded as x > ~. Therefore, O(1)/x > 0 as x >™, so 


O(1) = o(x) as x > &, 
1 1 
Pa 


(d) 
f(x) g(x) = [x +0(x°)] 
1 1 
Ea o(x’°) +0(x7)0( -}. 


1 1 
=~ +x0(- 
x x 


Now, by definition, |xO(1/x)| is bounded by a positive constant as 
x20, and o(x?)/x? >0 as x > 0. Hence, o(x?)/x? is bounded, 
that is, O(1). Furthermore, o(x*)O(1/x) = x[o(x’)/x?]xO( /x), 
which goes to zero as x — 0, is also bounded. It follows that 


1 
f(x)g(x) = = +001). 


3.3. (a) f(0) = 0 and lim, „o f(x) = 0. Hence, f(x) is continuous at x = 0. It 
is also continuous at all other values of x. 


(b) The function f(x) is defined on [1,2], but lim, , ,-f(x) = %, so f(x) 
is not continuous at x = 2. 
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3.6. 


3.7. 


3.8. 


3.9. 


3.10. 
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(c) If n is odd, then f(x) is continuous everywhere, except at x = 0. If 
n is even, f(x) will be continuous only when x > 0. (m and n must 
be expressed in their lowest terms so that the only common divisor 
is 1). 

(d) f(x) is continuous for x # 1. 


reo= {yr 320 


PA 


Thus f(x) is continuous everywhere except at x= 0. 


lim , „ọ-f(x) = 2 and lim, „ọ+f(x) = 0. Therefore, f(x) cannot be made 
continuous at x = 0. 


Letting a =b =0 in f(a +b)=f(a)+ f(b), we conclude that f(0) = 0. 
Now, for any x4, x, ER, 


F(x) =f(x2) + f(x; = x2). 
Let z =x; — x). Then 


|F) =F) l=) 
=| f(z) =f). 


Since f(x) is continuous at x = 0, for a given e> 0 there exists a ô> 0 
such that | f(z) —f(0)|<e if |z|< 8. Hence, | f(x,) —f(x,)|<e for 
all x,,x, such that |x, —x,|< ô. Thus f(x) is uniformly continuous 
everywhere in R. 


lim, _, ,-f(x) = lim, ., ,+f(x) =f() = 1, so f(x) is continuous at x= 1. 
It is also continuous everywhere else on [0,2], which is closed and 
bounded. Therefore, it is uniformly continuous by Theorem 3.4.6. 


(X27 %1) . (%2 FX 
cos x, — cos x, | =|2sin sin 


2 2 
(>) 
sin 

2 


<|x,-x,| forall x,,x, in R. 


<2 


Thus, for a given e> 0, there exists a 6>0 (namely, 6 < e) such that 
|cos x; — cos x,|< e whenever |x, —x,|< 6. 
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3.13. f(x) =0 if x is a rational number in [a, b]. If x is an irrational number, 


then it is a limit of a sequence {y,}"_, of rational numbers (any 
neighborhood of x contains infinitely many rationals). Hence, 


f(x) =f( lim y,) = lim f(y) = 9, 


since f(x) is a continuous function. Thus f(x) =0 for every x in [a, b]. 


3.14. f(x) can be written as 


3.15. 


3.16. 


3.17. 


3.18. 


F= KH 1; 
f(x) =(5, -1<x<1, 


34+2x, x21. 


Hence, f(x) has a unique inverse for x < —1 and for x > 1. 
The inverse function is 


y, y<l, 
=f! = 
ERAN omer y>1. 


(a) The inverse of f(x) is f'y) =2 — yy/2. 
(b) The inverse of f(x) is f-'(y)=2+ yy/2. 


By Theorem 3.6.1, 


È f(a, -f(b) | <K & |a;-5;| 
i=1 i=1 
<Kô. 


Choosing 6 such that ô< €/K, we get 
È | f(a;) = f(b) |<, 
i=1 


for any given e> 0. 


This inequality can be proved by using mathematical induction: it is 
obviously true for n = 1. Suppose that it is true for n = m. To show that 
it is true for n =m + 1. For n =m, we have 


| ait) vit aif (%:) 
f < : 
A A 


m 


m 
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3.19. 


3.20. 


3.22. 
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where An = L7,a;. Let 


Then 


f pr: a;X; =, Ambm zl Am+1¥m+1 
A 


m+1 


A n b Am +1 
< + —— x 
sg TOt Toni) 


But f(b,,) <(/A,,)&" , a; f(x;). Hence, 


f Ee a;X; er Gi f(%:) + net fl Xm+1) 
A A 


mt+1 m+1 


Le a; f(x;) 
Aaa 


m+1 


Let a be a limit point of S. There exists a sequence {a,}"_, in S such 
that lim a, =a (if S is finite, then S is closed already). Hence, 


f(a) = flim, a,)=lim, _,.. f(a,,) = 0. It follows that a € S, and S is 
therefore a closed set. 
Let g(x) = expl f(x)]. We have that 
f[ Ax, + (1 — A) x] < Af(x1) + (1 - A) f(x2) 

for all x,,x, in D,0<A<1. Hence, 

g[Ax, + (1—A)x,] <exp[Af(x,) + (1-A)f(42)] 

<Ag(x,) +1 — A) g(x), 

since the function e* is convex. Hence, g(x) is convex on D. 


(a) We have that E(|X |) >|E(X)|. Let X ~N( u, o°). Then 


E(X)= es exp = py? xdx. 
V2mo? -o 20 
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Therefore, 
1 œ 1 7 
|E(X)|< ae a — 5572 (4-4) |x| dx 
=E(|X|). 


(b) We have that E(e-*)>e~“, where w=E(X), since e™ is a 
convex function. The density function of the exponential distribu- 
tion with mean p is 


1 
g(x)=—e*/"#, 0<x<~m, 
H 


Hence, 
E(e*)= L felte de 
M~o 
1 1 1 
7 pies pti 
But 


1 1 
e"=1+u+ >u +t +—plt+ 
2! n! 
>S1lt+uy. 


It follows that 


and thus 
Be) se, 


3.23. P(|X7|>€) =P(|X, |>€'”7) 20 as n>, since X, converges in 
probability to zero. 


3.24, 
o?= E|(X-n)'| 

=E||X- u°] 
> F°||X- ul], 


hence, 
o> E||X- pl]. 
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CHAPTER 4 


4.1. 


4.3. 


4.4. 


4.5. 


san CAAA) OA |, KO) =F) 


h>0 2h h>0 2h h>0 2h 
E Ly D 
2 2 h>0 (—h) 


FO FPO 
Sa e 


The converse is not true: let f(x) = |x|. Then 


TOEN 
2h í 


and its limit is zero. But f’(0) does not exist. 


For a given e> 0, there exists a ô> 0 such that for any x € N;(xọ), 
LE Lio 
f(x) -f(%0) 
— -f <E. 
x—X, f'(%0)|<e 
Hence, 
f(x) -f(%0) 
——_—_—|<|f’ + 
x—X, | f (x9)| € 
and thus 


f(x) —f(%0) | <Alx— xol, 
where A =| f’(x9)| +. 


g(x) =f(x+ 1) —- f(x) =f'(é), where € is between x and x+1. As 
x — œ, we have > œ and f’(é)— 0, hence, g(x) > 0 as x > %. 


We have that f’(1) = lim, ,,+ (x? — 2x + 1)/(x — 1) = lim, , ,-(ar? — 
bx+1+1)/(x-— 1), and thus 1 = 2a —b. Furthermore, since f(x) is 
continuous at x= 1, we must have a—b+1= —1. It follows that 
a =3, b=5, and 


' -_ 3x? -— 2, x>1, 
P(x) s x<1, 


which is continuous everywhere. 


CHAPTER 4 571 


4.6. Let x > 0. 
(a) Taylor’s theorem gives 


(2h)? 
2! 


f(x + 2h) =f(x) +2hf'(x) + f'(é), 


where x < €<x+2h, and h > 0. Hence, 


1 
f'(x) = 5, et 2h) -f0)] -F"(8), 


so that | f’(x)|<m)/h+hmy,. 
(b) Since m; is the least upper bound of | f’(x)|, 


Mo 
mı < z + hm,. 
Equivalently, 
m,h?-mh+m z0 
This inequality is valid for all A > 0 provided that the discriminant 
A =m? — 4mm, 
is less than or equal to zero, that is, m? <4mym). 


4.7. No, unless f'(x) is continuous at x). For example, the function 


x+1, x>0, 
f(x) =| 0, x=0, 
x-1, x<0, 


does not have a derivative at x = 0, but lim, „o f(x) = 1. 


4.8. 
yalt Xii yim al Eily y; 
D'(y;) = lim | j | z | 1l | 
ay; a—y; 
— lim [ya] + Bie AT AT Binili] 
ay; ay; 


which does not exist, since lim |y; —a|/(a —y;) does not exist. 


a> yj 
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d [R] xf'(2) fx) 
dx| x x? i 


xf'(x) — f(x) = 0 as x > 0. Hence, by ’Hospital’s rule, 


a (LO gto SOAP O SO 
Sode| x |) x0 2x 


x0 


— 1 n 
= af (0). 


. Let x,,x, E€ R. Then, 


f(x) -= f(x2)| =|(x; =x) FE), 


here é is between x, and x,. Since f'(x) is bounded for all x, 
| f’(€)|<M for some positive constant M. Hence, for a given e> 0, 
there exists a 5>0, where Mô < e, such that | f(x,) —f(x,)|<e if 
|x, —x,|< 6. Thus f(x) is uniformly continuous on R. 


4.11. f(x) =1+cg"(x), and |cg’(x)|<cM. Choose c such that cM < 5. In 


4.12. 


this case, 
leg'(x)| < z, 
so 
-3 <cg'(x) < 3, 
and, 


z<f' (x) <3. 


Hence, f'(x) is positive and f(x) is therefore strictly monotone increas- 
ing, thus f(x) is one-to-one. 


It is sufficient to show that g'(x) = 0 on (0, ~): 


g'(x)= ae ; x>0. 


By the mean value theorem, 


F(x) =f(0) +xf'(c), 0<c<x 
=xf'(c). 
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Hence, 


f(x) —xf"(e) F(X) -F'(0) 


x x 


x>0. 


g'(x) 
Since f'(x) is monotone increasing, we have for all x > 0, 
f'(x) >f'(c) andthus g’(x) >0 
4.14. Let y = (1 + 1/x)". Then, 


_ log(1+1/x) 


log y =x log{1 + 
x 


1/x 
Applying l’Hospital’s rule, we get 
1 oe 
=a (1 na 
lim logy = lim a (i a 1. 


x70 x> 


Therefore, lim, _,.. y = e. 


4.15. (a) lim, _, 9+ f(x) = 1, where f(x) = (sin x)’. 
(b) lim, „o+ g(x) =0, where g(x) =e7'/*/x. 


4.16. Let f(x) =[1 +ax+o(x)]'”*, and let y=ax+o(x). Then y=x[a+ 


o(1)], and [1 + ax + o(x)}'/* = (1 +y) 2 (1 + y)?/”, Now, as x > 0, we 
have y > 0, (1 +y) >e“, and (1 + y)?/ > 1, since 


o(1) 
y 


log(1 +y)>0 asy0. 


It follows that f(x) > e". 
4.17. No, because both f'(x) and g'(x) vanish at x = 0.541 in (0, 1). 
4.18. Let g(x) =f(x) — y(x — a). Then 
g'(x)=f'(x)- y, 
g'(a) =f'(a) — y<0, 
g (6) =" (b) = y> 0. 
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The function g(x) is continuous on [a, b]. Therefore, it must achieve its 
absolute minimum at some point é in [a, b]. This point cannot be a or 
b, since g(a) <0 and g'(b)>0. Hence, a< €<b, and g'(é) =0, so 
FE) =y. 


. Define g(x) = f(x) — t(x — a), where 


T= VAS (x). 
i=l 


There exist c,,c, such that 
max f’(x;) =f’(c.), a<c,<b, 
i 


min f’(x;) =f'(c,), a<c, <b. 


If f'(x) =f'(x,) = = =f'(x,), then the result is obviously true. Let us 
therefore assume that these n derivatives are not all equal. In this case, 


f'(e1) <T<f (c2). 


Apply now the result in Exercise 4.18 to conclude that there exists a 
point c between c, and c, such that f'(c)= r. 


We have that 
> [FO:) —f(x;)] z EPOa), 


where c; is between x; and y; (i= 1,2,...,n). Using Exercise 4.19, 
there exists a point c in (a, b) such that 


Liai) c) = 
Li-1(¥i —%;) 


f(e). 
Hence, 


È LAO) -D1 =F) È Os). 


4.21. log +x) = £3 (= 1)"7tx"/n,  |x|<1. 


4.23. f(x) has an absolute minimum at x = ż. 
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4.24. 


4.25. 


4.26. 


4.27. 


The function f(x) is bounded if x? + ax + b #0 for all x in [—1,1]. Let 
A=a’ — 4b. If A <0, then x? +ax+5>0 for all x. The denominator 


has an absolute minimum at x = —a/2. Thus, if —1 < —a/2 <1, then 
f(x) will have an absolute maximum at x= —a/2. Otherwise, f(x) 
attains its absolute maximum at x = —1 or x= 1. If A=0, then 
IOVS aaa 
e3 
2 
In this case, the point —a /2 must fall outside [— 1, 1], and the absolute 
maximum of f(x) is attained at x = —1 or x = 1. Finally, if A > 0, then 
1 
f(x) = 


(x-=x;)(x-x3) l 

where x, = 3(-a — VA), x, = 3(-a + VA). Both x, and x, must fall 
outside [— 1, 1]. In this case, the point x = —a/2 is equal to $(x, +x»), 
which falls outside [—1,1]. Thus f(x) attains its absolute maximum at 


x=-lorx=1. 


Let H(y) denote the cumulative distribution function of G~'[F(Y)]. 
Then 


H(y)=P{G"'[F(Y)] <y} 
=P[F(Y) <G(y)] 
=P{Y <F'[G(y)]} 
= F{F'[G(y)]} 
= G(y). 
Let g(w) be the density function of W. Then g(w) =2we-”’, w > 0. 


(a) Let g(w) be the density function of W. Then 


exp| — (w'3-1)'|,  w#0. 


1 
ew) = za 08a 0.08 


(b) Exact mean is E(W) = 1.12. Exact variance is Var(W) = 0.42. 
(c) E(w) = 1, Var(w) = 0.36. 
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4.28. Let G(y) be the cumulative distribution function of Y. Then 
G(y) =P(Ysy) 
=P(Z’ <y) 
=P(|Z|<y'”) 
= P(-yl/? <Z <y") 
=2F(y'/)-1, 


where F(-) is the cumulative distribution function of Z. Thus the 
density function of Y is g(y)=1/y2mrye™/, y >0. This represents 
the density function of a chi-squared distribution with one degree of 
freedom. 


F(x +h) —F(x) 
4.29. (a) failure rate = 


1—F(x) 
(b) 
dF dx 
Hazard rate = GEIR 
1— F(x) 
1 1 
= ap X/o 
=|; | 
1 
a. 
(c) If 
dF(x)/dx 
—~ "ae, 
1— F(x) 
then 
—log[1 - F(x)] =cx +c, 
hence, 


1- F(x) =c,e. 
Since F(0) = 0, c, = 1. Therefore, 


F(x)=1-e. 
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4.30. 
n(n—1)-+*(n—r+#1) (aAty’ pa 
P(Y, =r) = (n-1)=( | }(1- =] 
r! n n 
n(n=1)=(n=r+1) (At) i Any i At\" 
n” r! | n =) ; 
As n > %, the first r factors on the right tend to 1, the next factor 
is fixed, the next tends to 1, and the last factor tends to e~*’. 
Hence, 
eat)’ 
lim P(Y, =r) = : 
nao r! 
CHAPTER 5 


5.1. (a) The sequence {b,}”_, is monotone increasing, since 
max{a,,a,...,a,} <max{a,,a,...,4,41}- 
It is also bounded. Therefore, it is convergent by Theorem 5.1.2. Its 


limit is sup,,, ;4,, Since a, <b, <sup,.,a, forn>1. 
(b) Let d; = log a; —logc, i=1,2,.... Then 


alae ily, 
logc, = — Yo loga;=logc + — YF d,. 
n izi n iz 
To show that (1/n)X;_;, d;—>0 as n > %: We have that d; > 0 as 
i>, Therefore, for a given e> 0, there exists a positive integer 
N, such that |d;|< €/2 if i > N,. Thus, for n >N}, 


È d; 


S d : N. 
i |+—(n- 
sa Uldl+ 5 N 


1) € 
<= L Flic meee 
n ja 2 
Furthermore, there exists a positive integer N, such that 
1 5S j € 
— i 
z Llal<5, 


if n>N,. Hence, if n >max(N,, N3), |(1/n)X7_, d;|< €, which 
implies that (1/n)X?_, d; > 0, that is, log c„ > log c, and c,, >c as 


nwo, 
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5.2. 


5.3. 


5.4. 


5.5. 
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Let a and b be the limits of {a„}-; and {b,„} -1, respectively. These 
limits exist because the two sequences are of the Cauchy type. To show 
that d, > d, where d= |a — b|: 


|d, —d|=| |a, =b,|-lļa=b|| 
<|(4, —b,) — (a =b) | 


<|a,—a|+ |b, >b]. 
It is now obvious that d, > das a, >a and b, >b. 


Suppose that a„— c. Then, for a given e> 0, there exists a positive 
integer N such that la, —c|<e€ if n>N. Let b, =a, be the nth term 
of a subsequence. Since k,, >n, we have |b, — c| <e if n >N. Hence, 


b,c. Vice versa, if every subsequence converges to c, then a, > c, 
since {a„} -1 is a subsequence of itself. 


If E is not bounded, then there exists a subsequence {b,}"_, such that 
b, > %, where b, =a, , k,<k,< +++ <k, < +++. This is a contradic- 
tion, since {a,,}"_, is bounded. 


(a) For a given e>0, there exists a positive integer N such that 
|a, —c|<e if n>N. Thus, there is a positive constant M such 
that |a, —c|<M for all n. Now, 


Lia 1 G4; p z 
y7 =C 7 2 a;(a;— c 
i=1 ®i i=1 i |i=1 
n 
E Yo a ue a;(a;— + an 2 a;(a;— c) 
i=1 îi |i=1 i=1 Qi |i=N+1 
1 N 1 n 
<=, — Lajla,-e|+ = È ala;—c| 
i=1 % j=1 i=1 % j=N+1 
M n 
Sic Ti a, > a SS een: X Qi 
i=1 “i ij=1 i=1 “i i=N+1 
Lia; 
<M 5; +e. 
v1 Oj 
Hence, as n > ©, 
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(b) Let a, =(—1)". Then, {a,}°_, does not converge. But (1/n)X"_, a; 
is equal to zero if n is even, and is equal to —1/n if n is odd. 


Hence, (1/n)X7_, a; goes to zero as n > &, 


5.6. For a given e> 0, there exists a positive integer N such that 


if n > N. Since b < 1, we can choose e so that b + e< 1. Then 


a 
1 <cb+e<1l, n>N. 


n 


Hence, for n> N+1, 


ayy. <ayii(b +€), 


2 
ay+3 <yin(D + €) <ayy(b+e), 


n-N-1 an+1 n 
a, <ay(b + €) Spa . 


Letting c =ay41/(b + €)"t!, r=b+e, we get a, <cr”, where 0< 
r<1l,frn>N+1. 


5.7. We first note that for each n, a, > 0, which can be proved by induction. 
Let us consider two cases. 
1. b>l1. In this case, we can show that the sequence is (i) bounded 
from above, and (ii) monotone increasing: 
i. The sequence is bounded from above by vb, that is, a, < vb: 
a> <b is true for n = 1 because a, =1 <b. Suppose now that 
až <b; to show that a2,, <b: 


580 


SOLUTIONS TO SELECTED EXERCISES 


Thus a?,, <b, and the sequence is bounded from above by vb. 
ii. The sequence is monotone increasing: we have that (3b + 
a*)/(3a*+b)>1, since a2 <b, as was seen earlier. Hence, 
4,41 >4, for all n. 
By Corollary 5.1.1(1), the sequence must be convergent. Let c be 
its limit. We then have the equation 


c(3b +c’) 


3c? +b ” 


which results from taking the limit of both sides of a„,; =a,@b + 
a>)/(3a2 +b) and noting that lim, ,..4,,, =lim,,...a, =c. The 
only solution to the above equation is c = yb. 

2. b<1. In this case, we can similarly show that the sequence is 
bounded from below and is monotone decreasing. Therefore, by 
Corollary 5.1.1(2) it must be convergent. Its limit is equal to vb. 


5.8. The sequence is bounded from above by 3, that is, a„ <3 for all n: This 


is true for n = 1. If it is true for n, then 


1/2 


Ong =(2+a,) 7 <(2+3)” <3. 


By induction, a, <3 for all n. Furthermore, the sequence is monotone 
increasing: a, <a>, since a, = 1, a, = V3. If a, <a,,,, then 


1/2 


Anya = (2 44y41) >(2+a,) =An+1- 


By induction, a, <a,,, for all n. By Corollary 5.1.1(1) the sequence 
must be convergent. Let c be its limit, which can be obtained by solving 
the equation 


c=(2+c)”. 


The only solution is c = 2. 


. Let m=2n. Then a, -—a,=1/(n+1)+1/(n +2) + = +1/2n. 


Hence, 


n 
Gin mers em 


Therefore, |a,, —a,,| cannot be made less than 4 no matter how large 
m and n are, if m = 2n. This violates the condition for a sequence to 
be Cauchy. 
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5.11. For m >n, we have that 
|, =H; = (am Sage) + (Gy —1 =am-2) 

hes +(4n42 =a,41) T (4441 —a,)| 

bee be > +- +br" 

=br"(1+r4+r?7 t+ +r”) 
br"(1-r™™" br” 

gee < 

1-r 1-r 
For a given e> 0, we choose n large enough such that br”/(1 —r) < e, 


which implies that |a,,—a,|<e, that is, the sequence satisfies the 
Cauchy criterion; hence, it is convergent. 


5.12. If {s,}°_, is bounded from above, then it must be convergent, since it is 
monotone increasing and thus /",_,a, converges. Vice versa, if Ly _, a, 
is convergent, then {s,}"_, is a convergent sequence; hence, it is 
bounded, by Theorem 5.1.1. 


5.13. Let s, be the nth partial sum of the series. Then 


S| 1 1 
si rc (cs ae 


1/1 1 1 1 1 1 
-5(5-3+3-3 "nel ura 
1/1 1 

-53-57 

1 

E as n> œ, 


5.14. For n > 2, the binomial theorem gives 


1\" 1 1 1 
1+—) =1+|”]-+ natet 
| n Gi k n? n” 
1 1 1 1 2 
=2+ (1- Pe Oy P| eee 
2! n 3! n n n” 
nd 
<2+ } — 
in? Í! 
pese 
<2t PEE 
i=2 2 : 
<3 
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Hence, 


Therefore, the series ©*_,(n'/" — 1)? is divergent by the comparison 
test, since L_,1/n? is divergent for p < 1. 


5.15. (a) Suppose that a, <M for all n, where M is a positive constant. 
Then 


a a 


n 


lt+a 


n 


1+M` 


> 


n 


The series X%_; a /(1 +a,„) is divergent by the comparison test, 
since X% 14, is divergent. 
(b) We have that 


an 


1— ‘ 
1l+a, la, 


If {a} -; is not bounded, then 


a 
lim a, = ©, hence, lim —“— =1 #0. 


noo no 1+a, 


Therefore, L*_,a,/(+4a,,) is divergent. 


5.16. The sequence {s,}"_, is monotone increasing. Hence, for n = 2,3,... 


an an VES 1 1 
7S = E TS 
Sa SaSn-1 SaSn-1 Sn-1 Sn 
n n 
a; a a; 
3 eo ee 3 57 
i=1 ”i i=2 “i 
a 1 1 1 1 
2 | ee -— 
S S1 S2 Sn-1 Sn 
a 1 1 a 1 
==+--—-- st 
Se es Se. OS a 


since s, — % by the divergence of L*_,a,,. It follows that L*_, a,/s? is 
a convergent series. 
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5.17. Let A denote the sum of the series X% -4 a,. Then r, =A —s,_,, where 
S, = Li_, a;. The sequence {r,,}"_, is monotone decreasing. Hence, 


n wise =e 
a; Am Fans, + +a,-4 Fm Fa 
LS = 
i=m r; Fm l'n 
= Fa 
r 


Since r, > 0, we have 1 —1,,/r,, > 1 as n > œ. Therefore, for 0 < e< 1, 
there exists a positive integer k such that 


This implies that the series X5; 4„/r„, does not satisfy the Cauchy 
criterion. Hence, it is divergent. 


5.18. (1/n)log(1 /n) = — (log n)/n > 0 as n > %. Hence, (1/n)!/" > 1. Simi- 
larly, (1/n)log(1 /n*) = —(2logn)/n > 0, which implies (1/n?)!/” > 1. 


5.19. (a) a)/"=n'/"—1>0<1as n> %. Therefore, L*_,a, is convergent 
by the root test. 


(b) a, <[log(1 +n) /log e” = [log +n)]/n?. Since 


00 


~ log(1 +x) 1 
Í 7 — d=- —log(1 +x) 


) o» dx 
op 
1 1 x(x+1) 


x 


o 


x 
=log2 + los{ =) 


x+1)i1 


= 2log2, 


the series L*_,[log(1 +n)]/n? is convergent by the integral test. 
Hence, U7 _, a„ converges by the comparison test. 


© a,/a,4; =n + 2)2n+3)/Qn+1)? = lim, „on (a,/a,4,—- 1) 
= +> 1. The series £*_, a, is convergent by Raabe’s test. 


(d) 
Peer eT ee 
n+2vn -n 
ae errr ae 


1 as n > ©, 


Therefore, Xy -14,„ is divergent. 
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(e) |a,|'/"=4/n—>0<1. The series is absolutely covergent by the 
root test. 

(f) a, =(-1)"sin(a7/n). For large n, sin(m/n)~ m/n. Therefore, 
YX -1 4, is conditionally convergent. 


5.20. (a) Applying the ratio test, we have the condition 


Any) 


|x|? lim 


n>% 


<1, 


a, 


which is equivalent to 


— <1, thatis, |x| <3. 


The series is divergent if |x| = V3. Hence, it is uniformly conver- 
gent on [—r,r] where r< ¥3. 


(b) Using the ratio test, we get 


An+1 


|x| lim <1, 


noo 


a, 


where a, = 10"/n. This is equivalent to 
10|x| <1. 


The series is divergent if |x| = +. Hence, it is uniformly convergent 
on [—r,r] where r < 5. 

(c) The series is uniformly convergent on [—r, r] where r < 1. 

(d) The series is uniformly convergent everywhere by Weierstrass’s 
M-test, since 


COS nx 1 


< for all x, 
n(n? +1)| n(n? +1) oe 


and the series whose nth term is 1 /Wn? + 1) is convergent. 


5.21. The series ©”, _,a, is convergent by Theorem 5.2.14. Let s denote its 
sum: 
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Now, 


"lias And On 
Let s}, denote the sum of the first 3n terms of X7; b,. Then, 
S3n = Son H Uys 
where s,,, is the sum of the first 2n terms of L*_, a„, and 


1 1 1 
= + + +, 
2n+1 2n+3 4n-1 


u n=1,2,... 


The sequence {u,}"°_, is monotone increasing and is bounded from 
above by 4, since 


n 
i <= ——— n=1,2,... 
2n+1 


(the number of terms that make up u, is equal to n). Thus {s5,} -1 is 
convergent, which implies convergence of X71 b,. Let £ denote the 
sum of this series. Note that 


t>(1+4-i) +(+), 


since 1/(4n — 3) + 1/(4n — 1) — 1/2n > 0 for all n. 


5.22. Let c„ denote the nth term of Cauchy’s product. Then 


n 2 1 
“CD È [(n-k + 1)(k+ 1]? 
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Since n-k+1<n+1landk+1<n+1, 


4 1 


[nla E 


day ntl 


Hence, c, does not go to zero as n > œ. Therefore, L*_, C, is diver- 
gent. 


5.23. f(x) > f(x), where 


The convergence is not uniform on [0,~), since f,(x) is continuous on 
[0,) for all n, but f(x) is discontinuous at x = 0. 


5.24. (a) 1/n* < 1/n'**. Since Y%_,1/n'*® is a convergent series, 


(b 


) 


Er _;1/n* is uniformly convergent on [1 + 6,%) by Weierstrass’s 
M-test. 

d/dx(1/n*) = —(Vogn)/n*. The series L*_,(ogn)/n* is uni- 
formly convergent on [1 + 6, ~): 


logn 1 logn 
w |S nZ ee x2d+1. 

Since (log n)/n? > 0 as n > œ, there exists a positive integer N 
such that 

logn : 

3/2 <1 ifn>N. 

n 
Therefore, 

logn 1 
n PEZA 


if n>N and x> ô+ 1. But, L*_,1/n'*’” is convergent. Hence, 
Er (logn)/n* is uniformly convergent on [1 + 6,%) by Weier- 
strass’s M-test. If ¢(x) denotes the sum of the series L%_, 1/n*, 
then 


log n 


Ta x>6+1. 
n 


œ=- E 
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5.25. We have that 


S 7 1 
er ta kak 


a “lia” 


if —1<x<1 (see Example 5.4.4). Differentiating this series term by 
term, we get 


E k 
B a Te 
It follows that 
z nt+r-1)\_, n fUFP) 
Lal ; |p (LP) =P ri 
_r(—p) 
P 
Taking now second derivatives, we obtain 
2 7 k-1 k(k +1) 
n(n—1)x" alee )- a. 
Ena- TaS 


From this we conclude that 


T = kx(1 + kx 
male t jan = ( R 
n=1 n (1 =x) 
Hence, 
E Poel DN » P'(1-p)r[1+r(1-p)] 
l a |p (=p) = = 
n=0 P 
ttl =p) ram) 
p’ l 
5.26. x 
n +r-1 ran 
d(th= Ye 4 A a 
n=0 
rye [ntr-1 n” 
i(i 
n=0 n 
aE z qe' <1. 
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The series converges uniformly on (—~,s5] where s< —logg. Yes, 
formula (5.63) can be applied, since there exists a neighborhood N;,(0) 
contained inside the interval of convergence by the fact that —log q > 0. 


5.28. It is sufficient to show that the series L*_,(u),/n!)r" is absolutely 
convergent for some 7> 0 (see Theorem 5.6.1). Using the root test, 


ana 
= ji perce eee, 
á Pa ti (ny) 
| 1 jir 
= Te lim sup in 
PE m) "ntt t2n 
CAK 
= çe lim sup 
now 
Let 
1 
CAM 


m = lim sup 


noo 


If m>O and finite, then the series L*_,(ui,/n!)r" is absolutely 
convergent if p <1, that is, if 7<1/(em). 


5.30. (a) 
E(X,) = ee p)” 


=0 


n-i 


oe p)” 
n! 
a p A 


n-1 
=np(p+1-—p) 
=np. 


We can similarly show that 


E(X;) =np(1—p) +n’p’. 
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Hence, 
Var(X,,) =E(X;) —n’p* 
Var(X,,) =np(1—p). 


(b) We have that P(Y, |>ro) <1/r*. Let ro=e, where o? =p(1 — 
p)/n. Then 


p1 -p) 
P(Y — 
(| rn /2e€)< ne? 


(c) As n>~, P(|Y, |> e)— 0, which implies that Y, converges in 
probability to zero. 


5.31. (a) 
b(t) = (Pre t+q,) >  qp,=1- Pn 
=[1+p,(e'-1)]". 


Let np, =r,- Then 


To i 
,(t)=[1+ 2(e'- 0] 
n 
> exp| pe’ — 1)| 
as n > ©, The limit of ¢,(¢) is the moment generating function of a 
Poisson distribution with mean u. Thus S, has a limiting Poisson 
distribution with mean m. 
5.32. We have that 
Q= (I, —H, +kH, —k?H,+ eH, - +) 
= (I, —H,) +k°H, — 2k°H;, ++. 
Thus 
y'Qy=y' (1, — Hy + k’y'Hgy — 2k*y'Hyy + + 
=SS,+k?S,—2k38,+ °° 


= SS; + x (i-2)(—-k)''S,. 
i=3 
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CHAPTER 6 


6.1. 


6.2. 


6.3. 


6.4. 


If f(x) is Riemann integrable, then inequality (6.1) is satisfied; hence, 
equality (6.6) is true, as was shown to be the case in the proof of 
Theorem 6.2.1. Vice versa, if equality (6.6) is satisfied, then by the 
double inequality (6.2), S(P, f) must have a limit as A, => 0, which 
implies that f(x) is Riemann integrable on [a, b]. 


Consider the function f(x) on [0,1] such that f(x) =0 if 0<x< i, 
f(x) =2 if x=1, and f(x) =4%,)1/2' if (n+ D/nt+2)<x< 
(n+ 2)/(n + 3) for n=0,1,2,.... This function has a countable num- 
ber of discontinuities at 45 2 3, ..., but is Riemann integrable on [0, 1], 
since it is monotone increasing (see Theorem 6.3.2). 


Suppose that f(x) has one discontinuity of the first kind at x=c, 
a<c<b. Let lim,_,.-f(x) =L,, lim, .+f(x) =L,, L,#L,. In any 
partition P of [a,b], the point c appears in at most two subintervals. 
The contribution to US,(f) — LS (f) from these intervals is less than 
2(M — m)A,; where M,m are the supremum and infimum of f(x) on 
[a,b], and this can be made as small as we please. Furthermore, 
uM, Ax; — U;m; Ax; for the remaining subintervals can also be made 
as small as we please. Hence, USp(f)— LSp(f) can be made smaller 
than e for any given e> 0. Therefore, f(x) is Riemann integrable on 
[a,b]. A similar argument can be used if f(x) has a finite number of 
discontinuities of the first kind in [a, b]. 


Consider the following partition of [0,1]: P={0,1/2n,1/(Qn — 1), 
...,1/3,1/2, 1}. The number of partition points is 2n + 1 including 0 
and 1. In this case, 


T 1 
cos| Z @n = 1] — =— cos mn 
2n-1 2 2n 


+e 


1 1 T 
T mag olan = 1)] = zaoz aa n| 


1 1 3m T 1 
+ gOS ae > + cos( = = JORR 
1 1 1 
= + + Seared 
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As n>, X?! |Af, |> %, since L*_,1/n is divergent. Hence, f(x) is 
not of bounded variation on [0, 1]. 


6.5. (a) For a given e> 0, there exists a constant M > 0 such that 


F(x) 


g'(x) ee 


<E 


if x >M. Since g'(x) > 0, 


[f’(x) — Le'(x)| < eg'(x). 


Hence, if A, and A, are chosen larger than M, then 


[PU Le] 


< f IF) - Lg'(x)|dx 
<ef” g(a) dx. 


(b) From (a) we get 


| f(A2) —Lg(Ax) — f(A) + Lg(à)| <e[g(A) —g(A,)]. 


Divide both sides by g(A,) [which is positive for large A,, since 
g(x) > œ% as x > œ], we obtain 


Fes f(A) 8(A,) | Fen 


Ba (2) 8a) l 8S) 
<E. 
Hence, 
fO) KEDI g(A,) 
8(A,) -t]<et 8(A,) FeO 


(c) For sufficiently large A,, the second and third terms on the 
right-hand side of the above inequality can each be made smaller 
than e; hence, 


6.6. We have that 


mg(x) <f(x)g(x) < Mg(x), 
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where m and M are the infimum and supremum of f(x) on [a,b], 


respectively. Let € and ņ be points in [a,b] such that m=f(€), 
M = f(n). We conclude that 


MOROL 
f(é)< OR <f(n). 


By the intermediate-value theorem (Theorem 3.4.4), there exists a 
constant c, between é and n, such that 


f f(x) g(x) de 
[s(x ax 


=f(c). 


Note that c can be equal to é or 7 in case equality is attained at the 
lower end or the upper end of the above double inequality. 


Integration by parts gives 


J'E g(x) = f(b) 8(b) ~ f(a) + f'g) al-F]: 


Since g(x) is bounded, f(b)g(b) > 0 as b > œ. Let us now establish 
convergence of {?g(x)d[—f(x)] as b>: let M>O be such that 
|g(x)| <M for all x >a. In addition, 


f Mal-f(x)] = Mla) - Mfo) 
— Mf(a) as b > œ. 


Hence, /{7Md|-—f(x)] is a convergent integral. Since — f(x) is monotone 
increasing on [a,°%), then 


MEOE) < f -Ma[—f(2)]. 


This implies absolute convergence of [f g(x) d[—f(x)]. It follows that 
the integral {7 f(x) dg(x) is convergent. 
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6.10. Let n be a positive integer. Then 


na| Sin x rsin x 27 SiN X 
J w= f ae — [ae 
0 x 0 x ye x 
ntr sinx 
EE, 
(n-1)r X 
7 1 1 1 
=f sin x| — + — + + — | dk 
0 x xt+a7 x+(n-1)7 


1 1 1 T 
> [ob ee [sin xdx 
T NT 0 


—> %0 as n > æ, 


2 1 1 
-Żfit5 teti 
T 2 n 


6.11. (a) This is convergent by the integral test, since ff (log x)/ (xvx ) dx = 
4fse “*xdx = 4 by a proper change of variable. 

(b) (n+ 4)/(2n3 + 1)~1/2n’. But L%_,1/n’ is convergent by the 
integral test, since ff dx/x* = 1. Hence, this series is convergent by 
the comparison test. 

© 1/Vn+1 -1)~1/vn. By the integral test, L*_, 1/vn is diver- 
gent, since f? dx/Vx = 2vx | =, Hence, © _,1/(Vn +1 — 1) is 
divergent. 


6.13. (a) Near x = 0 we have x”~'(1 —x)""! ~x™71, and fj x7! dx =1/m. 
Also, near x=1 we have x”“i(1 — x)! ~(1-—x)"t, and 
Jo —x)"~! dx =1/n. Hence, B(m, n) converges if m>0, n > 0. 
(b) Let vx = sin 8. Then 


T 


f'xi x)" de = 2 f ?sin?™10cos?" -t9 do. 
0 0 


(c) Let x=1/(1 +y). Then 


n-1 
1 =F n-1 at y 
x” (1-—x dx = ain dy. 
J ( ) J, (1+y) Pe 


Letting z = 1 — x, we get 
f m-1 n-1 = 0 m-1 _n-1 
x” (1-x) dx=- f (1-2) z"* dz 
0 1 


= fea =x)” dx. 
0 
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Hence, B(m, n) = B(n, m). It follows that 


m-1 


B(m,n) = E dx. 


(a) 
x”! 


m+n 


S= I (1+x) 


n-1 n-1 


i a rh 
= men AX + | mrn AX. 
i (1+x)"" J, (1+x)”" 


Let y = 1/x in the second integral. We get 


n=l m-1 


o0 xX 1 y 
m 7 dx = m v d 
J (1+ x) T baz mA 


Therefore, 
$ 1 x” a 1 m-1 > 
m,n) = 74a Mn + Z LV Mtn 
( ) J (1+x)"" J, (1+x)”" 
1x77! +x”! 


a 


6.14. (a) 


co dx o dx o dx 
if v1+x3 J v1+x? 3) Vi +x 


The first integral exists because the integrand is continuous. The 
second integral is convergent because 


oc dx o dx 
J -= s aan = 2. 


Hence, fdx/V1 +x? is convergent. 

(b) Divergent, since for large x we have 1/(1 +x?) ~1/(1 +x), and 
fe de/ +x) =~, 

(c) Convergent, since if f(x) =1/(1 — x°) and g(x) =1/0 -x)!%, 
then 


3 


lim 


ea a) 


But /ig(x) dx = $. Hence, {ji f(x) dx is convergent. 
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(d) Partition the integral as the sum of 


1 dx o dx 
(> = at fe 
o vx (1 +x) 1 vx (1+2x) 
The first integral is convergent, since near x=0 we have 
1/[vx +x)]~1/ vx, and fd dx/ Vx =2. The second integral is 
also convergent, since as x > % we have 1/[vx +2x)])~1/2x3”, 
and ff dx/x3/? =2 
6.15. The proof of this result is similar to the proof of Corollary 6.4.2. 
6.16. It is is easy to show that 
n-1 
_ (b=x) 


WS ey 


a es 
Hence, 


hiya) = fy(b) = fy x) ae 


z zle- x)" Sde, 


since h,(b) = 0. It follows that 


(b-a)”* 
(n-1)! 


1 1 
bp [O21 FC) ds 


f(b) =f(a) + (b-a) f'(a) += + fo- (a) 


6.17. Let G(x) = J,‘g(t) dt. Then, G(x) is uniformly continuous and G'(x) = 
g(x) by Theorem 6.4.8. Thus 


JEDE FG f EOF) de 


b 
=f(b)G(b) — f G(x) f'(x) de. 
Now, since G(x) is continuous, then by Corollary 3.4.1, 


G(é) <G(x) <G(n) 
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for all x in [a,b], where é, n are points in [a, b] at which G(x) achieves 
its infimum and supremum in [a, b], respectively. Furthermore, since 
f(x) is monotone (say monotone increasing) and f'(x) exists, then 
f'(x) = 0 by Theorem 4.2.3. It follows that 


GCE) S'PO) des f°G(s) f(x) de< GC) fF (x) de. 
This implies that 
[G(x F(x) d= f FO) de, 


where G(é)<A<G(m). By Theorem 3.4.4, there exists a point c 
between é and 7 such that A = G(c). Hence, 


[OF (x) de= Ge) fF (x) ae 


= [F(b) ~F(a)] f 80) ae. 


Consequently, 
J Egla) de =f(b)G(b) = [G(x f(x) a 
= f(b) f g0) de= [F(b) ~f(a)] f(x) a 


= f(a) f g(x) de + f(b) f 80) de 


6.18. This follows directly from applying the result in Exercise 6.17 and 
letting f(x) = 1/x, g(x) = sinx. 


p Sin x 1 c, 1 b. 
J = de = 7 f sin xdr + pf sin xde 
1 1 
= 7 (cos a — cosc) + z (cosc — cos b). 
Therefore, 
psin x 1 1 
J dx| < —|cos a — cos c| + —|cos c — cos b| 
a Xx a a 


< 


Q | 
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6.21. 


6.22. 


6.23. 


Let e>0 be given. Choose 7 > 0 such that 


n[g(b) —g(a)] <e. 


Since f(x) is uniformly continuous on [a, b] (see Theorem 3.4.6), there 
exists a ô> 0 such that 


f(x) -f(z)| <n 


if |x —z|<6 for all x, z in [a,b]. Let us now choose a partition P of 
[a,b] whose norm A, is smaller than ô. Then, 


USp(f, g) — LSp(f, 8) = X (M;—-m,;)Ag; 
i=1 


n 


<n}, Ag; 


i=1 


= nlg(b) —8(4)] 
Ke, 


It follows that f(x) is Riemann-Stieltjes integrable on [a, b]. 


Since g(x) is continuous, for any positive integer n we can choose a 
partition P such that 


g(b)-8(a) 


Ag =1,2,...,n. 


i 


Also, since f(x) is monotone increasing, then M; = f(x;), m; =f(x;_,), 
i=1,2,...,n. Hence, 


b)-g(a) 2 
US?(f,g) —LSp(f, 8) = sO) L HED) = f(x;-1)] 
b)—g(a 
= E EO Efe) -fa 


This can be made less than e, for any given e> 0, if n is chosen large 
enough. It follows that f(x) is Riemann-Stieltjes integrable with re- 
spect to g(x). 


Let f(x) =x*/ +x?), g(x)=x*7?. Then lim, ,..f(x)/g(x) = 1. The 
integral fẹ x*~? dx is divergent, since fë x*~* dx =x*7!/(k-— 1) =% 
if k>1. If k=1, then fọ x*~? dx = ff dx/x =~. By Theorem 6.5.3, 
Jof(x) dx must be divergent if k > 1. A similar procedure can be used 
to show that f° „f(x)dx is divergent. 
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6.24. Since E(X) = 0, then Var(X) = E(X”), which is given by 


7 o xe 
E(X as ere aa 


Let u =e*/(1 + e*). The integral becomes 


cu ffelt) 


= fios == A -) [ulogu + (1 =u) 1og(1 1] 


1 [ulogu + (1 — u) log(1 —u)] j 
I, u(1—u) S 
log u log(1 — 
-= - f' g uf g( Ube 
0 peu u 
log u 
= -2f 2 du. 
But 
logu 
je 2 du= f (1 +u +u? + = +u" + = )logudu, 
o l-u 
and 
1 1 : 1 
f u” logudu = u"*! logu] — —— uae 
0 n+1 n+1 
1 
(n+1) 
Hence, 
pe e > 1 a 
o 1-u nao (n+1)” 6” 


and E(X?) = 77/3. 
6.26. Let g(x) =f(x)/x*%, h(x) = 1/x*. We have that 


| &(*) 
Ere Coe a 
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Since fê h(x) dx = 6'~*/(1 — a) <~@ for any 5>0, by Theorem 6.5.7, 
fÈ f(x) /x* dx exists. Furthermore, 


f(x) Lp 1 
I Ti des zaf fas a. 
Hence, 
oo ff x 
BO) =p ee dx 
exists. 


6.27. Let g(x) =f(x)/x!**, h(x) =1/x. Then 


g(x) 


=k <x, 
xo0* A(X) 


Since {?h(x)dx is divergent for any 5>0, then so is {?g(x)dx by 
Theorem 6.5.7, and hence 


EXO] = f'g(x) det f g(r) de ==. 


6.28. Applying formula (6.84), the density function of W is given by 


n+1 
2r' y2 -=(n+1)/2 
g(w)= ay |1+ ; ; w>0 
rl — 
ae | : | 
6.29. (a) From Theorem 6.9.2, we have 
1 


P(X =p| Sue) = — 
u 


Letting uo =r, we get 


o? 


P(X- y] >r)<-7. 


(b) This follows from (a) and the fact that 


o? 


E(X,)=m  Var(X,)=—. 


600 SOLUTIONS TO SELECTED EXERCISES 


(c) -2 


7: 


P(|X, — ul 2 €) < — 


Hence, as n > %, P(|X, — pl >e)> 0. 


6.30. (a) Using the hint, for any u and v > 0 we have 
uv, ı + 2uvy, +07», ,, 20. 
Since v,_, = 0 for k > 1, we must therefore have 
vvi — VV 1V1 <O, k>1, 
that is, 
Pep Shia RaW 2H 1. 


(b) vy, =1. For k=1, v?<v,. Hence, v; <v}. Let us now use 
0 12h 15n 
mathematical induction to show that v!/” < v}/¢*® for all n for 


which v, and v,,, exist. The statement is true for n = 1. Suppose 
that 


1/(n-1) 1/n 
vi < yl 7 


To show that v}/" < yi{"*): We have that 


Thus, 
+1 
ae 
Hence, 
pe < pee +1) : 
CHAPTER 7 


7.1. (a) Along x, =0, we have f(0,x,)=0 for all x,, and the limit is zero 
as x > 0. Any line through the origin can be represented by the 
equation x, =tx,. Along this line, we have 


[tx | [t| [x2] 
f(tx2,%2) = ae: 2 ary a 


[t | | 
= — exp| — — ], x, #0. 


Using l’Hospital’s rule (Theorem 4.2.6), f(tx,,x,) > 0 as x, > 0. 
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(b) Along the parabola x, =x3, we have 
x3 x3 
f(x, x2) = xP x2 
=e) 240, 


which does not go to zero as x, > 0. By Definition 7.2.1, f(x,, x2) 
does not have a limit as x > 0. 


7.5. (a) This function has no limit as x — 0, as was shown in Example 7.2.2. 
Hence, it is not continuous at the origin. 


(b) 
Of (x4, X2) ; 1 OAx, 
— = lim | — |- -0| =0, 
Ox, Lag ey 0 Ax, (Ax) +0 
Of(x,,x 1 OAx 
DEEN a Ge ~ -0}} =0. 
IX -0 2200 Ax, | 0+ (Ax,) 
7.6. Applying formula (7.12), we get 
df & of(u) 
= : = tr} : 
dt È xi ðu; OAS) 
where u; is the ith element of u=fx, i = 1,2,..., k. But, on one hand, 
of(u) > of(u) du; 
Ox; j=1 ðU; ðX; 
_ fw 
Ju, ° 


and on the other hand, 


df(u) ae Of (x) 


ðX; ðX; 
Hence, 
ofu) — If) 
=t š 
ðU; Ox; 


It follows that 
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can be written as 


n-1 : Af (x) n-1 
t in a t”! f(x) 
that is, 
k oof (x 
En L nfo. 


7.7. (a) f(x,,x,) is not continuous at the origin, since along x,=0 and 
x, #0, f(x,, x2) = 0, which has a limit equal to zero as x, > 0. But, 
along the parabola x, = xj, 


1 
f(x x) = a if x, #0. 


Hence, lim, ., 9 f(x, x7) = 7 #0. 
(b) The directional derivative in the direction of the unit vector v = 
(vi, v, at the origin is given by 


df (x) 


Ox, 


df (x) 


OXy 


Vy V2 


x=0 x=0 


This derivative is equal to zero, since 


of) 1 f Axo IL, 
OX, |o Ax 0 Ax, (Ax) +0 i 
Of (x) oe 1 OAx, so 
OX. [z0  Ax270 Ax, | 0+ (Ax)? ` 


7.8. The directional derivative of f at a point x on C in the direction of v is 
L*_,v, Af(x)/dx;. But v is given by 


-s/l 
dt | || dt 
-s/e 


dx E 
dt | || dt 
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where g is the vector whose ith element is g,(t). Hence, 


eS 1 Š dg:(t) ere) 


bax, Yael dt ae, 
Furthermore, 
ds [3 
at (aell 


It follows that 


Ko Of(x) dt & dg(t) df(x) 
Lig X; ds dt Ox; 
k dx; df (x) 
i. i=l ds OX; 
_ f(x) 
rs ds ` 
7.9. (a) 7 
f(x, x2) =f(0,0) + KETA ae ea -|o 0) 
Ef a ð? ; ð? 
zr xiz t 4x, Ox, OX, 2 f(0,0) 
=1+x,x, 
(b) é 
f (41, %2,%3) = f(0,0,0) + Ln mal f(0,0,0) 


1 
= sin(1) +x, cos(1) + zr Ut [eos(1) — sin(1)] + 2x3cos(1)}. 


(c) 
f(x, x2) =f(0,0) + | x, = A FF Zro, 0) 
1 ð? ð? 2 
T Mags a ahoo 


since all first-order and second-order partial derivatives vanish at 
x=0. 
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7.10. (a) If u; =f(x;, x) and df/dx,#0 at xo, then by the implicit 
function theorem (Theorem 7.6.2), there is neighborhood of x in 
which the equation u, =f(x,, x,) can be solved uniquely for x, in 
terms of x, and u,, that is, x, =h(u,, x,). Thus, 


FACU, x2), x2] =u. 
Hence, by differentiating this identity with respect to x,, we get 


af oh af 
oo 


OX, OX OX> 


ah of | af 
OX, OX, | OX, 


in a neighborhood of xp». 


which gives 


(b) On the basis of part (a), we can consider g(x,, x,) as a function of 
x, and u, since x, =h(u,,x,) in a neighborhood of xy. In such a 
neighborhood, the partial derivative of g with respect to x, is 


dg oh f óg A(f,8) óf 
0%, 0%. OKs “Olax x)| OX 


which is equal to zero because d(f, g)/d(x,, x,) = 0. [Recall that 
dh/ dx, = —(ðf/ðx,Xðh/ óx). 

From part (b), g[h(u,, x,), x,] is independent of x, in a neighbor- 
hood of x). We can then write (u,)=g[h(u,, x3), x3]. Since 
x, =h(u,,x,) is equivalent to u; =f(x,,x,) in a neighborhood of 
Xo, AL f(%, x2)] = g(x, x2) in this neighborhood. 

Let G(f, g) =g8(%,, x.) — ol f(x;, x2)]. Then, in a neighborhood of 
Xo, Gf, g) = 0. Hence, 


(c 


wm 


Nm 


(d 


IG of dG og 
+ 
Of Ox, dg Ox, 


IG of óG ag 
+ 
Of dX, dg OX, 


=0. 


In order for these two identities to be satisfied by values of 
JG/ of, IG/dg not all zero, it is necessary that (f, g)/A(x,, x3) 
be equal to zero in this neighborhood of xy. 
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TAL. u(x, X2, X3) =u, éz, é &, é3). Hence, 


ou ðu Ox, ðu OX, ðu OX; 
= + + 
df, OX, ð; ðX, ð; AX IE; 


ðu ðu ðu 
= + + 3 
éi Ox, 2 OX> OX 
so that 
ðu ðu 
+ eee 
Ga a oa VE 
ðu ðu ðu 
=x; +x, +X nu 
ð OX, 
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Integrating this partial differential equation with respect to é}, we get 


log u =nlog é; + (é, é), 


where (é,, é) is a function of €,,é,. Hence, 


u = éj'exp[(é,, €,)] 
=x;F( é1, én) 


where F(é, é) = expl y( é, €,)]. 


7.13. (a) The Jacobian determinant is 27x?x2x2, which is zero in any subset 


of R? that contains points on any of the coordinate planes. 


(b) The unique inverse function is given by 


= 7/3 _ 1/3 = 173 
x, =u, X, =U”, X; =Uuy”. 


7.14. Since 0(g,, g2)/ (xı, x2) #0, we can solve for x4, x, uniquely in terms 
of y, and y, by Theorem 7.6.2. Differentiating g, and g, with respect 


to y,, we obtain 


gı OX, | 9%, OX, y Of, 


OX, OY, OX, OY, dy, 
g2 OX, O87 IX, 4 óga 


Ox, OY, OX, OY, dy, 


Solving for dx,/dy, and dx,/dy, yields the desired result. 
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7.15. This is similar to Exercise 7.14. Since df, g)/A(x,, x.) #0, we can 
solve for x,, x, in terms of x3, and we get 


dx, (fg) | A(f.8) 


dx; 7 ó( x3, X3) | A(x x2)’ 
de, (fig) | AF8) 
dx, ð( X1, X3) ó( xi, X2) 
But 
Af,g) __ A(f.8) Af,g) Af, 8) 
ó( x3, X2) á( x2, x3)? (x1, x3) JETESE 
Hence, 


de, 7 de, i dt, 
A(f,g)/A( x2, x3) A(f,g)/0( x1, x2) A(f,g)/0( x31)” 


7.16. (a) (— 4, — 4) is a point of local minimum. 


(b) 
ð 
we. =4ax,—x,+1=0, 
Ox, 
ð 
a = —x, +2x,—-1=0. 
OX> 


The solution of these two equations is x, = 1/(1 — 8a), x, = (4a — 
1)/(8a — 1). The Hessian matrix of f is 


_| 4a -1 
a-| 44 ma 


Here, f,, =4a, and det(A) = 8a — 1. 
i. Yes, if a> $. 
ii. No, it is not possible. 
iii. Yes, if a < $, œ +0. 
(c) The stationary points are (—2, —2), (4,4). The first is a saddle 
point, and the second is a point of local minimum. 
(d) The stationary points are (V2, — v2), (— V2, V2), and (0,0). The 
first two are points of local minimum. At (0,0), fi; fo. —f5 = 0. In 
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this case, h'Ah has the same sign as f,;|0,0) = —4 for all values of 
h = (h, h), except when 


or h,—h,=0, in which case h'Ah =0. For such values of h, 
(h’'V)5f(0,0) = 0, but (h’V)*f(0, 0) = 24(h4 + h4) = 48h, which is 
nonnegative. Hence, the point (0,0) is a saddlepoint. 


7.19. The Lagrange equations are 


(2+ 8A)x, + 12x, =0, 
12x,+(4+2A)x,=0, 
4x? +x} =25. 
We must have 


(4A +1)(A +2) —36=0. 


The solutions are A, = —4.25, A, =2. For A, = —4.25, we have the 
points 
i. x, = 1.5, x, = 4.0, 

ii. x, = — 1.5, x, = — 4.0. 
For A, = 2, we have the points 

iii. x, = 2, x, = —3, 

iv. x, = —2, x, =3. 
The matrix B, has the value 


2+8A 12 8x, 
B;=| 12 4+2A 2x}, 
8x, 2x, 0 


The determinant of B, is A). 
At (i), A, = 5000; the point is a local maximum. 
At (ii), A, = 5000; the point is a local maximum. 
At (iii) and (iv), A, = —5000; the points are local minima. 
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7.21. Let F=xjx3xj + A(x? +x7 +x3 — c°). Then 


OF Po 

TA = 2x; x5x3 + 2x =0, 
OF ae 

T = 2xi x, x3 +2Ax, =0, 
OF do 

Ox, = LXIX K F 2AX, = 0, 


xptxptxz=c’. 
Since x,,x,,X3, cannot be equal to zero, we must have 


x3xz+A=0, 


xx? +A=0, 
x?x +A=0. 
These equations imply that x? =x} =x} =c’/3 and A= —c*/9. For 


these values, it can be verified that A, <0, A, >0, where A, and A, 
are the determinants of B, and B,, and 


QWxZxZ+2A  4xx x Ages, Dx; 
E 4xix, x3  2x?x2+2ÀA  4xîx,x; 2x, 
i Ate X3 Axi t,  2X?XZ+2A 2x, 
2x; 2x, 2x3 0 
2x?x3+2A  4xiX x; 2x, 
B =| 4xjx.x,  2xjxj+2A 2x, 


2X5 2X3 0 


Since n = 3 is odd, the function xjx>xj must attain a maximum value 


given by (c*/3)°. It follows that for all values of x4, x7, X3, 


3 
20D c? 
xixix3 <|>] 


that is, 
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7.24. The domain of integration, D, is the region bounded from above by the 
parabola x,=4x,—xj and from below by the parabola x,=x;. The 
two parabolas intersect at (0,0) and (2,4). It is then easy to see that 


Efe a | dx, = pE zara ae 


7.25. (a) I= fdl Pies See ree ier 


b) dg/dx, = jii Poe) 


ae dx, — 2x, f(x,,1 —x7) + f(x 1 = x1). 
1 


7.26. Let u=x3/x,, V =x?/x,. Then uv =x;x,, and 


2) 2. | 2(xi x2) 
dx, dx, = —— | du | dv. 
J [sexs ds, ds, ale Cn «|æ 
But 
(xi, X3) d(u,v) 1 
(u,v) OG 3. 
Hence, 


J [x dx, dx, = 


7.28. Let I(a) = f}? dx/(a +x”). Then 


aCe =2/% dx 
0 (a+x?) 
On the other hand, 
I Ar : 
= t 
(a)= = ctany/ — 7 
d’I(a) 3 3 B 1 
( ) = —a~*/*Arctan + 5 
da 4 a 4 a*(a+3) 


V3 (2a+3) 
igen ees 
2 a(at+3) 
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Thus 
dx 3 3 y3 1 
3 = a5 Arctan + 5 
0 (a +x?) 8 a 8 a(a+3) 
ae (2a +3) 
Eg a?’ (a +3) 


Putting a = 1, we obtain 


= — Arctan y3 + ——. 


F dx 3 713 
0 (1+x3 8 64 


7.29. (a) The marginal density functions of X, and X, are 


f(x) = IKES +x) dx, 
=x, +5, 0<x <1, 
f(x) = IKES +x) dx, 
=y 0<x,<1, 
f(%1,%2) #fi(x1)f2(x2). 


The random variables X, and X, are therefore not independent. 


(b) 


1 


1 
E(X,X3) = Í J xian +x) dx, dx, 


1 
3° 


7.30. The marginal densities of X, and X, are 


(1+x, —-1<x,<0, 
fa) = ji-a 0<x,<1, 
0, otherwise, 


falx) = 2%, 0<x, <1. 
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Note that f(x,, x2) #f\(x)f,(x,). Hence, X, and X, are not indepen- 
dent. But 


1 X2 
E(X,X,) = f |J xina dri [d2 =0, 
A 

0 1 

E(X,) = f x1 +x) de, + f x(1-x1) de, =0, 
=] 0 
1 

E(X,) = f 2x} de, = 5. 

Hence, 


E(X,|X,) =E(X,)E(X,) = 0. 


7.31. (a) The joint density function of Y, and Y, is 


= = ee O 
‘(a-y,)? aie I, 


1 
g(y 92) = Tals)" 


(b) The marginal densities of Y, and Y, are, respectively, 


T(a 
gy) = ara =y),  O0<y, <1, 


gO y= FOB) arp igy, ees 
oe Ca ECB). i 


(c) Since B(a, B)=Ir(a)F(B)/F(a+ B), 


8Y Y2) = 801) 822); 
and Y, and Y, are therefore independent. 


7.32. Let U=X,, W=X,X,. Then, X, =U, X,=W/U. The joint density 
function of U and W is 


10w? 
g(u,w)=—z-, w<u<vw, 0<w<l. 
u 
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7.34. 


7.37. 


7.38. 
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The marginal density function of W is 


y du 
yw 
gi(w) = 10w? f wy) 


X=Y, 
4, =Y,—-%, 
Mets 


and the absolute value of the Jacobian determinant of this transforma- 
tion is 1. Therefore, the joint density function of the Y,’s is 


gD Vaoes Yn) = ei eT 271) wee eT On Vn) 


=e", UKY SRK L, 


The marginal density function of Y, is 
Yn fYn-1 »2 
(Y,) =e" -f dy, dy) dyp- 
Bal Yn) =e PPP [dr dyn dya 


0<y, <. 


Let B = ( Bo, By, B2. The least-squares estimate of B is given by 
B= (X'X) "X'y, 


where y = (y1, Y2, ---, Yp), and X is a matrix of order n X 3 whose first 
column is the column of ones, and whose second and third columns are 
given by the values of x,, x?, i=1,2,..., 7. 


Maximizing L(x,p) is equivalent to maximizing its natural logarithm. 
Let us therefore consider maximizing log L subject to Li, p;=1. 
Using the method of Lagrange multipliers, let 


F=logL +2 


k 
piel 
i=1 
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Differentiating F with respect to pj, p.,...,p,, and equating the 
derivatives to zero, we obtain, x;/p; + A =0 for i = 1,2,..., k. Combin- 


ing these equations with the constraint L‘_, p;=1, we obtain the 
maximum likelihood estimates 
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8.1. Let x; = (xi, Xp), h; = (hj, hny, i= 0,1,2,.... Then 
f(x; + th;) = 8(%;, + tha)? —4( xa + tha)(xXi + th) +5(x; + thp) 
=a; +b,t+c,, 
where 


a;= Shi, —4Ahiyhj + 5h’, 
b; = 16x hj — A(X hin + Xj2h1) + OX 2h 2, 


_ 9,2 2 
C, = 8X4, — 4X Xj. + 5X}. 


The value of t that minimizes f(x; + th;) in the direction of h, is given 


by the solution of df/dt = 0, namely, t; = —b,/(2a;). Hence, 
b; f 
KTAS a h,, i=0,1,2,..., 


where 


Vf(x;) 
h,=- 
© VAC) |b 


i=0,1,2,..., 


and 


Vf(x;) = (16x; — 4x,., —4x,, + 10x3), i=0,1,2,.... 


614 


8.3. 
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The results of the iterative minimization procedure are given in the 
following table: 


Iteration i X; h, t; a; fap 
0 (5,2) (-1,0Y 45 8 180 
1 (0.5, 2 0, = 1 1.8 5 18 
2 (0.5, 0.2) (—1,0y 0.45 8 1.8 
3 (0.05, 0.2) (0, -1Y 0.18 5 0.18 
4 (0.05, 0.02y (-1,0Y 0.045 8 0.018 
5 (0.005, 0.02)’ (0, —1Y 0.018 5 0.0018 
6 (0.005, 0.0027 (—1,0) 0.0045 8 0.00018 


Note that a;>0, which confirms that t;= —b;/(2a;) does indeed 
minimize f(x; +th,) in the direction of h;. It is obvious that if we were 
to continue with this iterative procedure, the point x; would converge 
to 0. 


(a) 
S(x) = 45.690 + 4.919x, + 8.847x, 


— 0.270x, x, — 4.148x? — 4.298x3. 


(b) The matrix Ê is 


p = | 74-148 — 0.135 

—0.135 —4.298 |" 
Its eigenvalues are 7, = —8.136, 7, = —8.754. This makes Ê a 
negative definite matrix. The results of applying the method of 
ridge analysis inside the region R are given in the following table: 


À Xi Xa r y 
31.4126 0.0687 0.1236 0.1414 47.0340 
13.5236 0.1373 0.2473 0.2829 48.2030 

7.5549 0.2059 0.3709 0.4242 49.1969 
4.5694 0.2745 0.4946 0.5657 50.0158 
2.7797 0.3430 0.6184 0.7072 50.6595 
1.5873 0.4114 0.7421 0.8485 51.1282 
0.7359 0.4797 0.8659 0.9899 51.4218 
0.0967 0.5480 0.9898 1.1314 51.5404 
— 0.4009 0.6163 1.1137 1.2729 51.4839 
— 0.7982 0.6844 1.2376 1.4142 51.2523 


Note that the stationary point, xo = (0.5601, 1.0117)’, is a point of 
absolute maximum since the eigenvalues of B are negative. This 
point falls inside R. The corresponding maximum value of ŷ is 
51.5431. 
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8.4. We know that 


8.5. 


(B—A,I,)x, = — 38, 


($ - M1 )x2 = — 38, 
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where x, and x, correspond to A, and A,, respectively, and are such 


that xx, =xx,=r°, with r° being the common value of r? 


The corresponding values of ¥ are 


A 


Ji = Bo + xi B + x’ Bx,, 
f= Êo + x Ê + x’, Bx,. 
We then have 


x’ Bx, — x, Bx, + (x) —x,)B = (A, - A3)r’, 


I, -I.= zx =x,)B + (à - M )r?. 


Furthermore, from the equations defining x, and x,, we have 


(à = Ay) xX. = 2x4 —x))B. 


Hence, 


fi- =(A- Ay) (1? = xX). 


and r3. 


But r? = |Ix,|l2 Ik, |l2 > xx», since x, and x, are not parallel vectors. 


We conclude that ~, >f, whenever A, > A. 


If M(x,) is positive definite, then A, must be smaller than all the 
eigenvalues of B (this is based on applying the spectral decomposition 
theorem to B and using Theorem 2.3.12). Similarly, if M(x,) i is indefi- 
nite, then A, must be larger than the smallest eigenvalue of B and also 
smaller than its largest eigenvalue. It follows that A, < à. Hence, by 


Exercise 8.4, 9, <p. 


8.6. (a) We know that 


(B- AI, )x = - 


Nik 
no 


Differentiating with respect to A gives 


(Ê - a) = =x, 
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and since x’x =r’, 


Ox or 
x’— =r—. 
On On 


A second differentiation with respect to A yields 


A ð? x 7 Ox 
(B- My) 7 ares 


ð? x ðX' ðX or ðr \? 
Xoo t fag T ; 
on OA OX On or 


If we premultiply the second equation by d7x’/0A’ and the fourth 


equation by dx’/0A, subtract, and transpose, we obtain 


/Ox Ox’ ðX 


x’; -2— — =0. 
dA dA OX 


Substituting this in the fifth equation, we get 


r= =3 S 
OX ON OX 


o'r Ox’ ðX [ T 


ðÀ 
Now, since 
or 1 
ee Ee /2 
a gee 
Ox/ OX 
(x’x)'”° f 


we conclude that 


5 ox ðX ax\? 
an OX OA ' 


(b) The expression in (a) can be written as 


a?r ðX' ðX | ox’ Ox | =) 
3 2 2 ' l 


r 


5 =2r + -|x 
Or OX OX OX OX rN 


The first part on the right-hand side is nonnegative and is zero only 
when r=0 or when 0x/dA = 0. The second part is nonnegative by 
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the fact that 


2 
2 


ax \* Ox 
x’—] <r?7]]— 
OX On 


2 ox Ox 


faa 
OX OA 


Equality occurs only when x = 0, that is, r = 0, or when 0x/dA = 0. 
But when dx/dA = 0, we have x = 0 if A is different from all the 
eigenvalues of B, and thus r = 0. It follows that d?r/dA? > 0 except 
when r= 0, where it takes the value zero. 


8.7. (a) 


nQ 5 
B= — | {EL9)] -nF dx 
i {Cy F BYT u(y =p) = 269 = B)T,.5 + 5'T 5}, 


where F, F, Fa are the region moments defined in Section 
8.4.3. 


(b) To minimize B we differentiate it with respect to y, equate the 
derivative to zero, and solve for y. We get 


20 (Cy = B) -2r = 0, 
y=B+T)'T)s 
=Cr. 


This solution minimizes B, since F}, is positive definite. 

(c) B is minimized if and only if y = Cr. This is equivalent to stating 
that Crt is estimable, since EQ) =y, 

(d) Writing À as a linear function of the vector y of observations of 


the form A = Ly, we obtain 
E(h) =LE(y) 
=L(XB + Zò) 
=L[X:Z]r. 


But y =E(X) = Cr. We conclude that C= LIX: Z]. 


(e) It is obvious from part (d) that the rows of C are spanned by the 
rows of [X: Z]. 
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(f) The matrix L defined by L = (X'X)~'X’ satisfies the equation 
L[X:Z] =C, 
since 

L[X:Z] = (X'X) ~ X'[X:Z] 
= [1:(X'x) 'x'z] 
= [I:M7'My| 
= [1: r3 T] 
=C. 


8.8. If the region R is a sphere of radius 1, then 3g” < 1. Now, 


oo OF 
© © UR oO 
our Ooo 
ule 


— 
a 
N 
l 
ooco 
ocooco 
ocooco 


Hence, C = [I,:O,,.; ]. Furthermore, 


X= [1,:D], 


Hence, 


oo oF 
ooo 
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(a) 


~~ 


(b 


8.9. (a) 


(b) 


Mp = 7X'Z 
0 0 0 
0 0 =g’ 
7 0 =g’ Oe 
—g? 0 0 


M,, =, > g’ 5 
M,=l, = g=0 


Thus, it is not possible to choose g so that D satisfies the condi- 
tions in (8.56). 
Suppose that there exists a matrix L of order 4 X 4 such that 


C=L[Xx:Z]. 
Then 
I, = LX, 
0=LZ. 


The second equation implies that L is of rank 1, while the first 
equation implies that the rank of L is greater than or equal to the 
rank of I,, which is equal to 4. Therefore, it is not possible to find a 
matrix such as L. Hence, g cannot be chosen so that D satisfies the 
minimum bias property described in part (e) of Exercise 8.7. 


Since A is symmetric, it can be written as A= PAP’, where A isa 
diagonal matrix of eigenvalues of A and the columns of P are the 
corresponding orthogonal eigenvectors of A, each of length equal 
to 1 (see Theorem 2.3.10). It is easy to see that over the region yw, 


h(8,D) < Semnal A) <17e max (A); 


by the fact that e,,,,(A)I — PAP’ is positive semidefinite. Without 
loss of generality, we consider that the diagonal elements of A are 
written in descending order. The upper bound in the above in- 
equality is attained by h(6,D) for 6=rP,, where P, is the first 
column of P, which corresponds to e,,,,(A). 

The design D can be chosen so that e,,,.(A) is minimized over the 
region R. 


max 


8.10. This is similar to Exercise 8.9. Write T as PAP’, where A is a diagonal 
matrix of eigenvalues of T, and P is an orthogonal matrix of corre- 
sponding eigenvectors. Then 8/T8=u'u, where u= A P'S, and 
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AC, D) = õ' Sò = w'A ~! P'SPA ~" u. Hence, over the region ®, 


HSS > ke ninl A7 P'SPA~ 1"). 


But, if S is positive definite, then by Theorem 2.3.9, 


(A7 P’SPA~'/7) = e nin (PATH P S) 


€ min 


(T'S). 


= min 


8.11. (a) Using formula (8.52), it can be shown that (see Khuri and Cornell, 
1996, page 229) 


Ng 
Ay(k + 2) ~24(4,- *| 


V= 
Ay(k +2) —AZk dy 


Ay(k +1) — 03(k-1) 
Aa(k + 4) | 


where k is the number of input variables (that is, k = 2), 


1 n 
À, = — DE re i=1,2 
n u=1 
= * (0k 4202) 
n 
8 
= 7 
1% IN) 
à= p > tuku L#] 
Qk 
on 
4 
= 7 


and n =2* + 2k +n )=8+ny is the total number of observations. 
Here, x,; denotes the design setting of variable x, i= 1,2; u = 
TZN: 

(b) The quantity V, being a function of nọ, can be minimized with 
respect to no. 
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8.12. 


8.13. 


8.14. 


We have that 
(Y — XB) (Y — XB) = (Y — XB + XB — XB) (Y— XB + XB — XB) 
= (Y — XB) (Y— XB) + (XB — XB) (XB — XB), 
since 
(Y — xB) (xB — XB) = (X'Y— X' XÊ) (Ê — B) = 0 


Furthermore, since (XB — XB) (XB — XB) is positive semidefinite, then 
by Theorem 2.3.19, 


e,[(Y — XB) (Y — XB)] > e |(¥ - XB) (Y—XB)],  i=1,2,...,,, 


where e,-) denotes the ith eigenvalue of a square matrix. If (Y — 
XB) (Y — XB) and (Y — XÊ) (Y — XB) are nonsingular, then by multiply- 
ing the eigenvalues on both sides of the inequality, we obtain 


det[(Y — XB)' (Y — XB)] > det|(Y — XB) (Y — xB)]. 
Equality holds when B = B. 


For any b, 1 +b <e”. Let a=1+b. Then a <e‘'. Now, let A; denote 
the ith eigenvalue of A. Then A, > 0, and by the previous inequality, 


Hence, 
det(A) < exp|tr(A E I,)| A 
The likelihood function is proportional to 
—n/2 n = 
[det(V)] exp| — 5 t(SV Ji. 


Now, by Exercise 8.13, 


det(SV7!) < exp|tr(sv~' 7 1,)]. 
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since det(SV~') = det(V-'/7SV~'”7), tr(SV~!) = tr(V 712 SV- 1⁄2), and 
V-12 SVT is positive semidefinite. Hence, 


-1)]"/2 es =i n 
[det(sv-!)]"”  exp| — z t(SV )| < exp| — z") ; 
This results in the following inequality: 


[det(V)] "7 exp| - St(sv=)| < [det(S)] "7 exp] - st(,)]. 


which is the desired result. 


8.16. $(x) = oR, where 6 = (X'X)~'X’y, and f'(x) is as in model (8.47). 


Simultaneous (1 — a) X 100% confidence intervals on f'(x)B for all x in 
R are of the form 


PÊ F (PMS p Fy pnp)’ EOR O]. 


For the points x,,x,,...,X,,, the joint confidence coefficient is at least 
1-a. 
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9.1. If f(x) has a continuous derivative on [0,1], then by Theorems 3.4.5 


and 4.2.2 we can find a positive constant A such that 
(F) — (42) | <A] x = x2] 
for all x4, x, in [0,1]. Thus, by Definition 9.1.2, 
w(5) <A. 


Using now Theorem 9.1.3, we obtain 


3 A 
F) =b < 5 ae 


for all x in [0,1]. Hence, 
c 
sup [/(2) =b) 5 7 


OK<x< 


where c = 3A. 
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9.2. 


9.3. 


9.4. 


9.5. 


We have that 


F) —f(%.)|< sup | (21) -f(22)| 


|21-Z2|<8> 


for all |x, —x,|<6,, and hence for all |x, —x,|< ô}, since ô; < 6,. It 
follows that 


sup |f(x,)-f(*.)|< sup | f(z) — f(z) |, 


[x1 =x2|< ô; [z1 =z2|< 82 


that is, w(5,) < œ(ô,). 


Suppose that f(x) is uniformly continuous on [a, b]. Then, for a given 
e > 0, there exists a positive (e) such that | f(x,) —f(x,)|< € for all 
X1, X, in [a,b] for which |x, —x,|< 8. This implies that w(5) < e and 
hence œ(ô)— 0 as ô— 0. Vice versa, if w(5)— 0 as 6-0, then for a 
given e> 0, there exists ô >0 such that œ(8)<e if 5<6,. This 
implies that 


|F) —f(42)|<e if k; —x,| < ê< ô, 


and f(x) must therefore be uniformly continuous on [a, b]. 


By Theorem 9.1.1, there exists a sequence of polynomials, namely the 
Bernstein polynomials {b,(x)}"_,, that converges to f(x)= |x| uni- 
formly on [—a, a]. Let p,(x) =b,(x) — b,(0). Then p,(0) = 0, and p,(x) 
converges uniformly to |x| on [—a, a], since b,(0) —> 0 as n > œ. 


The stated condition implies that {jf(x)p,(x) dx = 0 for any polynomial 
D,(x) of degree n. In particular, if we choose p,(x) to be a Bernstein 
polynomial for f(x), then it will converge uniformly to f(x) on [0, 1]. By 
Theorem 6.6.1, 


UCI]? de = lim fF) pa) a 
=0. 


Since f(x) is continuous, it must be zero everywhere on [0,1]. If not, 
then by Theorem 3.4.3, there exists a neighborhood of a point in [0, 1] 
[at which f(x) #0] on which f(x) #0. This causes {gl f(x) dx to be 
positive, a contradiction. 
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9.6. Using formula (9.15), we have 


f(x) ~ p(x) = al OTe 


(n+ D 
But |IT7.9(x —a,)|<n!(h"*! /4) (see Prenter, 1975, page 37). Hence, 


n+1 
Til 


we f(x) = PO) Sa cei 


9.7. Using formula (9.14) with a), 44, a2, and a}, we obtain 


P(x) =4,(x) log a, +7,(x)log a, +7,(x)log a, +7,( x)log az, 


where 
aw- [EEs] 
a aea 
wk e e 
ae a 


Values of f(x) = log x and the corresponding values of p(x) at several 
points inside the interval [3.50, 3.80] are given below: 


x f(x) p(x) 
3.50 1.25276297 1.25276297 
3.52 1.25846099 1.25846087 
3.56 1.26976054 1.26976043 
3.60 1.28093385 1.28093385 
3.62 1.28647403 1.28647407 
3.66 1.29746315 1.29746322 
3.70 1.30833282 1.30833282 
3.72 1.31372367 1.31372361 
3.77 1.327075 1.32707487 
3.80 1.33500107 1.33500107 
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Using the result of Exercise 9.6, an upper bound on the error of 
approximation is given by 

qh’ 
sup | f(x) —p(x)|< ETa 


3.5<x<3 8 
where h = max,(a;,, — a;) = 0.10, and 


n= sup |[f(x)| 
3.5<x<3.8 
—6 


x4 


sup 
3.5<x<3.8 


a: 
~ (3.5)*" 


Hence, the desired upper bound is 
nht 6 ( 0.10 \* 
16 16 \ 3.5 
=2.5 x107. 


9.8. We have that 
SUO SDP d= SIOP ae— SIOP d 


Faf x)] dx. 


But integration by parts yields 
b UA n tid 
S SOLE) -s (x)] d 
n f F b b m 1 $ 
SDF S]e S LE) x(x] dx. 
The first term on the right-hand side is zero, since f’(x)=s'(x) at 
x =a, b; and the second term is also zero, by the fact that s” (x) is a 


constant, say s(x) = A,, over (T;,T;}1). Hence, 


[Ad EASO -ad= 


626 


SOLUTIONS TO SELECTED EXERCISES 
It follows that 
b n" ” 2 b n 2 b n 2 
SIO- a= PLP OP ae fs") dx, 
which implies the desired result. 


a"* h(x, 0) 


april = (-1)'0,0,03*1e7 9, 


Hence, 


sup 
O<x<8 


a” h(x, os 
et 
Using inequality (9.36), the integer r is determined such that 


copia) @ < 0.05. 


The smallest integer that satisfies this inequality is r= 10. The Cheby- 
shev points corresponding to this value of r are Zo, Z;,...,Z 9, where by 
formula (9.18), 


z;=4 + 4cos 


gs 


z, i=0,1,...,10. 


Using formula (9.37), the Lagrange interpolating polynomial that ap- 
proximates h(x, @) over [0,8] is given by 


10 
P(X,9) = 0, Ds [1 ~ dyer | A(x), 
i=0 


where 4x) is a polynomial of degree 10 which can be obtained from 
(9.13) by substituting z; for a; (i =0,1,...,10). 


. We have that 


ó'n x,a, ß 
a Sa = (0.49 = a) BtesPe P, 
x 
Hence, 
ón(x, a, B) 
2o 7 = (0.49 — a) B*e~7. 


It can be verified that the function f( 8) = B*e~ °° is strictly monotone 
increasing for 0.06 < B < 0.16. Therefore, B= 0.16 maximizes f( 8) 


CHAPTER 10 627 


over this interval. Hence, 


d(x, a, B) 


a < (0.49 — a)(0.16)*e~°? 


10 <x <40 


< 0.13(0.16)*e 7??? 
= 0.0000619, 


since 0.36 < a < 0.41. Using Theorem 9.3.1, we have 


RSA In(x, a, B) —s(x)| < 44 (0.0000619). 
10<x< 


Here, we considered equally spaced partition points with Ar; = A. Let 
us now choose A such that 


s A*(0.0000619) < 0.001. 
This is satisfied if A < 5.93. Choosing A = 5, the number of knots 


needed is 
40 — 10 
ne = 
A 
=5. 
9.12. 
i 2 2)? 
det(X'X) = [(x; - @)"( =x1) = 25-1) O2— @)°] . 
The determinant is maximized when x, = —1, x, = į(1 + a}, x,=1. 
CHAPTER 10 


10.1. We have that 
1 pa 1 pr 
-f cos nx cos mx dx = l [cos (n +m)x + cos (n — m) x] dx 
Tlr 21! 


_]0, n#m, 
1, n=me21, 


1 pr 1 pr 

= J cos nx sin mx dx = z_o +m)x-— sin (n —m)x] dx 
=0 for all m,n, 

1 pa 1 pr 

z sin nx sin mxdx = a= [cos (n —m)x—cos(n + m) x] dx 

Tl ag 27 -T 


_]0, n#m, 
1, n=m>2l1. 
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10.2. (a) 
(i) Integrating n times by parts [x” is differentiated and p,(x) is 
integrated], we obtain 
1 
f x”p,(x)dx=0 for m =0,1,...,n— 1. 
-1 
(ii) Integrating n times by parts results in 
1 1a n 
ý dx= — 1—x*) dx. 
J Pdaf a-e) 
Letting x = cos 0, we obtain 


f (1—x?)" dx= f sint odo 
-1 0 


22n+1 2n at 
= SS X 
2n +1 n ji Ika 
(b) This is obvious, since (a) is true for m = 0,1,...,n — 1. 
(c) 
[pd a= f oa (| + m_1(x)|p,(x) dx 
P si gn n n-1 Pn >, 
where 7,,_,(x) denotes a polynomial of degree n — 1. Hence, 
using the results in (a) and (b), we obtain 
1 gut =f 
1 2 2n 2n 
dx = 
J pals) al n al n | 
: 0 
= ——— > 
2n+1’ m 
10.3. (a) 
: A (2i—1)r 
(G) = cos; n Arccos a T 
(2i-1)7 
= cos —————. = 
2 


for i=1,2,...,n. 
(b) T/(x) =(n/ V1 -x° )sin(n Arccos x). Hence, 
 (2i-1)r 


n 
Lay = Tg sin #0 


for i=1,2,...,n. 
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10.4. (a) 


dH,(x) 
a 
Using formulas (10.21) and (10.24), we have 


3 d”(e™/) TO dor) 
(-1) xe* = as +(-1) e* +7 aa 
dH, (x) 
or ale a 

Eee Ha) 
= xH,,( x) eh el) 

=nH,_;(x), by formula (10.25). 


1)" xe?’ /2(-1)"e"" 7H, (x) 


(b) From (10.23) and (10.24), we have 


dH, (x 
H,„+1(x) = H,(x) = af d 
Hence, 
dH,„,(x) 2p) (x) = d’H,(x) 
dx dx $ dx? 
d’H,(x) _ dH,,(x) = dH, .,(x) 
dx? x dx a(x) dx 
dH 
=% = GG DA wa 
a) — nH, x) 


dx 
which gives the desired result. 
10.5. (a) Using formula (10.18), we can show that 


sin[(n + 1)6] 


sin 0 


n+1 


by mathematical induction. Obviously, the inequality is true for 
n= 1, since 


sad = 0 cos "| 


sin 6 


sin 0 
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Suppose now that the inequality is true for n = m. To show that it 
is true for n =m + 1(m = 1): 


sin[(m + 2) 0 | n 


sin 0 


sin|(m + 1)8]cos 8 + cos[(m + 1)0]sin 6 


sin 0 


<m+1+1=m+4+2. 


Therefore, the inequality is true for all n. 
(b) From Section 10.4.2 we have that 


dT (x) sin n0 


dx T 
Hence, 
dT (x) sin n0 
| dx [=n sin 0 | 
<n’ 


2 


since |sin n@/sin 6|<n, which can be proved by induction as in 
(a). Note that as x > F1, that is, as 9-0 or 0> m, 


| dT,,(x) | 2: 
a2, . 
a n 


(c) Making the change of variable t = cos 0, we get 


T(t 
f 9 dt = = (cos no do 
1Vy1-t? T 


Arccos x 


1 
= — —sin n0 
n 


T 


1 
— —sin(n Arccos x) 
n 


1 
——sin ny, where x = cos & 
n 


1 
——sin #U,_,(x), by (10.18) 
n 


1 
——V1-x7U,_,(x). 
n 
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10.7. The first two Laguerre polynomials are L,(x)=1 and L,(x)=x—- 
a — 1, as can be seen from applying the Rodrigues formula in Section 
10.6. Now, differentiating H(x,t) with respect to t, we obtain 
JH (x,t) 
ot 


at+1 S x Hos, 
1-t (1-1) 


or equivalently, 


JH (x,t) 


al [((x-a-1)+(a+1)t]H(x,t) =0. 


(1-1) 
Hence, g)(x) =HA(x,0)=1, and gx) =—0dH(x,t)/dtl-0 =x — 
a—1. Thus, go(x)=L,(x) and g(x)=L,(x). Furthermore, if the 
representation 


o0 = n 
H(x,t)= X = En( x)" 


! 
n=0 n: 


is substituted in the above equation, and if the coefficient of t” in the 
resulting series is equated to zero, we obtain 


En+1(x) + (2n -x +a +1)g,(x) + (n° +na)8g, (x) =0. 


This is the same relation connecting the Laguerre polynomials 
Lax), L(x), and L,,_,(c), which is given at the end of Section 
10.6. Since we have already established that g,(x) = L,(x) for n = 0,1, 
we conclude that the same relation holds for all values of n. 


10.8. Using formula (10.40), we have 


i 3x? 1 
=c +x +e — - = 
pł (x) = Cy tcx +c, 7 7 
5x? 3x 35x4 30x? 3 
+| — - —]| +c, > +>], 
2 2 8 8 8 
where 
1 x 
c= F Lae dx, 
Riera 
c= z| te ae, 
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7 ćʻ 5x3 3x 
C3 = =f Bt og |e dx, 

9 35x4 30x? 3 
c= ih = + — Je" dx. 
2/-1\ 8 8 8 


In computing Co, C1, C2, C3, C4 we have made use of the fact that 


1 2 
2 =~ =0,1,2,... 
J pi) ae Ant+1’ n 0, ? > bf 


where p,(x) is the Legendre polynomial of degree n (see Section 
10.2.1). 


10.9. 
Co 
f(x) = 2 +cTi(x) +cT3(x) +c3T3(x) + c,T,(x) 


= 1.266066 + 1.1303187,(x) + 0.271495T,( x) 
+ 0.044337T;( x) + 0.0054747,( x) 
= 1,000044 + 0.99731x + 0.4992.x? + 0.177344x? + 0.043792.x4 


[Note: For more computational details, see Example 7.2 in Ralston 
and Rabinowitz (1978).] 


10.10. Use formula (10.48) with the given values of the central moments. 


10.11. The first six cumulants of the standardized chi-squared distribution 
with five degrees of freedom are x,=0, k,=1, K;= 1.264911064, 
K, = 2.40, k; = 6.0715731, kK, = 19.2. We also have that Zoo; = 1.645. 
Applying the Cornish-Fisher approximation for x) ;, we obtain the 
value X95 = 1.921. Thus, P( ** > 1.921) = 0.05, where xž? denotes 
the standardized chi-squared variate with five degrees of freedom. If 
x? denotes the nonstandardized chi-squared counterpart (with five 
degrees of freedom), then y? = V10 y* +5. Hence, the correspond- 
ing approximate value of the upper 0.05 quantile of x2 is V10 (1.921) 
+5 = 11.0747. The actual table value is 11.07. 


10.12. (a) 
1 1 
fre ? dt=1— t+ = 
0 2X31! 2°x5xK2! 2°x7x3! 
= 0.85564649. 


+= 
2°xX9X4! 
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)+3%(3) 


(b) 
x 2 2 xX 
fe” /? dt = xe ™ slož 
0 2 
where 
1 x 
ala = aH] 
x 1 /x\2 x 
0(5)=5(5) (5) 
1x | x \2 | 
-AE 
2!\2 2 
a x\ 1 /x)\4l/x 
(5) at (; 
5 x) 1 /x)\*lix\ ie 
(5) a(3] (5 g 
Hence, 


fe? dt = 0.85562427. 
0 


10.13. Using formula (10.21), we have that 


TO E 


n=0 


d”(e™/2) 
dx” 


(x) 


d”(e™/) 


= » (-1)"b, = 


n=0 


c, a"b(x) 
dx” ’ 


where c, =(—1)"n!b,, n =0,1,... 


dx” , 


by (10.44) 
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10.14. The moment generating function of any one of the X,’’s is 


w(t) = fe f(x) de 


o a2 A3 dlx) à, d'p(x) %4 d(x) 
=i 2d r ay ae a 


By formula (10.21), 


d'b(x) 
dx" 


=(-1)"$(x)H,(x), 1 =0,1,2,... 


Hence, 
w= fe [809 + Boon Ho + 3 54 OO) HA) 
M 
+ Boot] 


eee À4 A3 
=2f e low + sy (2) HAs) + F(x) (0) 


since H,(x) is an odd function, and H,(x) and H,(x) are even 
functions. It is known that 


ee ae Jez 2”-1F(n), 
where T(n) is the gamma function 
T(n) = | en tae 2fve* “20-1 dy, 
n> 0. It is easy to show that 


o0 1 e 
e x)x” dx= — e7 /DA-2)ym dx 
io TET 


m+1 
2 


27 ~1n+1) 
= 1—2t) ? T 
7 | ) 
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where m is an even integer. Hence, 
oo 1 
x? = -1/2 
2f emplida m het) Ts) 
=(1-21)°'”, 


2 fe d(x) H,(x) de=2f e b(x)(x4 — 6x? +3) de 
0 0 


= L a-r) 
Da 2 


-= Za -2A P rG) +30 -2A , 


2 fe bx) Hol) dx= 2f epa" — 15x4 + 45x? — 15) dx 
= 8 = -7/2 p(1 = -5/2 p 5 
= (1 -2r) r(4)- = "a 2t) PG) 


s = Gi —2r) °° r) — 151-24)”. 


The last three integrals can be substituted in the formula for w(t). The 
moment generating function of W is [y(t)]". On the other hand, the 
moment generating function of a chi-squared distribution with n 
degrees of freedom is (1 — 2t)™”/ [see Example 6.9.8 concerning the 
moment generating function of a gamma distribution G(a, B), of 
which the chi-squared distribution is a special case with a=n/a, 


B=2). 
CHAPTER 11 
11.1. (a) i 
ENE E dx 
a= j Ixl 
2 pm 
= a xdx= 7, 
T0 


a 
ll 


1 pr 
a -f |x |cos nxdx 
TY — 


ll 


2 or 
—| xcos nx dx 
7 0 
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2 


|x] 


(b) 


ag = 
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NT 


T . 
f sin nxdx 
0 


2 


mn? 


[(-1)"-1], 


ll 


1 pa 
= | x |sin nx dx 
Te! -r 


=0, 


cos5x 
52 


cos 3x 


4 
— —|cos x + —;— 
T 3 


1 pa 
a |sin x |cos nxdx 
Ta 


2 


T0 


T 
sin x cos nx dx 


1 T 
© flame ean el 


(-1)"+1 


a(n? 1)’ 


ay 


n+l, 
0, 


1 pa 
—| |sin x | sin nx dx 
Ta 


cos2x  cos4x  cos6x 


|sin x| = 


2 ~( 
T T 


3 15 35 
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(c) 
a = =f" (x42) dr 
agen 


Qa? 


3 ? 
a, = =|" (x +x7)cos nx dx 
T -—7T 


2 os 
= — | x‘cosnxdx 
To 


= scosnt=(—1)"—, 
b ac: (x +x7)sin nxdx 
Tr 
2 pr 
=-f x sin nxdx 
TO 
2 
=(=4 eae 
("= 
2 


T 1 
x+tx7= T +4(—cos.x+ —sin x 
3 2 


1 1 
— a|- 32 —cos2x + qn 2x 
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for — 7 <x < m. When x = ¥ 7, the sum of the series is 5[(— 7+ 
Ww )+(7+n)l=7’, that is, c?=7°/34+404+1/2?4+1/3° 


+ s+), 


11.2. From Example 11.2.2, we have 


Putting x = +77, we get 
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Using x = 0, we obtain 


mT? o0 (- 1)" 
0=— +4 ; 
3 D n’ 
or equivalently, 
m 6 (- 1 ae 
12 D n’ 


Adding the two series corresponding to 7*/6 and 7*/12, we get 


3T? 2 1 
25 z3 
12 n=1 (2n-1) 


that is, 


. By Theorem 11.3.1, we have 


f'(x) = } (nb, cos nx — na, sin nx). 
n=1 


Furthermore, inequality (11.28) indicates that L%_,(n*b? + n’a?) is a 
convergent series. It follows that nb, > 0 and na, > 0 as n > œ% (see 
Result 5.2.1 in Section 5.2). 


11.4. For m >n, we have 


|Sn(*) —s,(x)|=| } (a, cos kx + b, sin kx) 
k=n+1 

E (+b) 

+1 


IA 


m 1 
==) lar + Be)” [see (11.27)] 


1/2 
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But, by inequality (11.28), 


m 1 pr 
L (ak + Bz) < -=f [F(x]? dx 
k=n+1 Wear 
and 
eee 
aes — 
k=n+1 k? k=n+1 k? 
æ AX 
< —— = — 
“SI, x n 
Thus, 
c 
[Sm(x) —s,(x)| > 
n 
where 


c= ESOP ae 


By Theorem 11.3.1(b), s„(x) > f(x) on [— m, 7] as m—> œ. Thus, by 
letting m —> ~, we get 


f(a) -5i()| se 


11.5. (a) From the proof of Theorem 11.3.2, we have 


2D, A, aT 


COs = = 
D T oO 


This implies that the series X7; (— 1)”b„/n is convergent. 
(b) From the proof of Theorem 11.3.2, we have 


(7+x) 


x d a x an, b, 
t) dt = ———— + — je > , 
J K ) 5 pif sin nx ~ — (cos nx — cos nT) 
Putting x = 0, we get 
0 ay 24b, f 
aga a= aa 
ny n=1 1 


This implies convergence of the series L*_,(b,/n)[1 —(—1)"]. 
Since L*_,(—1)"b,/n is convergent by part (a), then so is 
Er ab /n. 
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11.6. Using the hint and part (b) of Exercise 11.5, the series U*_,b,/n 


would be convergent, where b,=1/logn. However, the series 
Es 1/( logn) is divergent by Maclaurin’s integral test (see Theo- 


rem 6.5.4). 


11.7. (a) From Example 11.2.1, we have 


5b =i n+1 
=È ( : —— sin nx 
for — m <x < m. Hence, 
x? xt 
a= | zdt 
4 | 2 
cb 1 ati zi n+1 
=- >) ‘Ss 7 cos nx + x eo, 
n=1 n=1 n 
Note that 
e (=1)" 1 ax? 
X Eri 
ay N md yi 
2 
TR 
Hence, 
2 2 n+1 
x T ae (ea | 
Tage Cie b - ——-cos nx, —T<x< 7. 
(b) This follows directly from part (a). 
11.8. Using the result in Exercise 11.7, we obtain 
s[n? t (- D 
— dt = cos ntdt 
a aR ef 
a ( a l 
=, — sinn, —T<xX< T. 
n=1 n 
Hence, 
m =| n+1 ae x3 
ys ( 7 — sin nx = ——x- —, = T<X<T 
12 12 
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11.9. 
1 j —x? —iwx 
F(w)= zl e`" dx, 


i e m 
F'(w)= ag} (727 ye ™™* dx 


(exchanging the order of integration and differentation is permissible 
here by an extension of Theorem 7.10.1). Integrating by parts, we 
obtain 

i 


F'(w) aleve f e-*'(—iwe7*) dx 


=f e™( —iw)e'"* dx 


The general solution of this differential equation is 
F(w) =Ae~"’/4, 
where A is a constant. Putting w = 0, we obtain 
1 w 2 
A=F(0)= e dx 


Vr 1 
on Wa 
Hence, F(w) =(1/2Vm)e~"/4. 
11.10. Let H(w) be the Fourier transform of (f * g)(x). Then 


H(w) = a [S f-o) alea 


= a o| J Ha-ya dy 


o| 1 p% , : 
fle [free ar] (rye 


=F(w) f g(yye™ay 


=27F(w)G(w). 
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11.11. Apply the results in Exercises 11.9 and 11.10 to find the Fourier 
transform of f(x); then apply the inverse Fourier transform theorem 
(Theorem 11.6.3) to find f(x). 


11.12. (a) 
OA ys Pe 
S(x) > 7 sin] k(n “| 
n 2 —1)**! 
= Z ( 2 ( 1)“ si | | 
k=1 
n 2sin(kr/n 
-5 as z 
k=1 
(b) 


n sin(kr/n) m 
e easel 


a kr/n n` 


By dividing the interval [0,7] into n subintervals [(k — 
1)r/n, km/n], 1 <k <n, each of length z/n, it is easy to see that 
5,(x,) is a Riemann sum S(P,g) for the function g(x) = 
(2 sin x)/x. Here, g(x) is evaluated at the right-hand end point of 
each subinterval of the partition P (see Section 6.1). Thus, 


j j Ta 
lim s„(x,) = f x : 


11.13. (a) § = 8.512 + 3.198 cos œ — 0.922 sin ¢ + 1.903 cos2 + 3.017 sin 2¢. 
(b) Estimates of the locations of minimum and maximum resistance 
are o= 0.7944r and ¢ = 0.11537, respectively. [ Note: For more 

details, see Kupper (1972), Section 5.] 


11.15. (a) 


* — 
DaT 


1 n 
Y,-n 
map Le 


pen 
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where U=Y,—-p, j=1,2,...,n. Note that EWU) = 0 and Var(U,) 
=o”. The characteristic function of s* is 


p,(t) =E(e"%) 
= E( e" ®i-10;/ 4") 


n 
= [[E(et 4/7"). 
j=l 


Now, 
t? 
E(e"") =E|1+ itU,- sre dae 
2g? > 
= +o(t?), 
= +o) 
= t? t2 
merim mi- Foo) 
2n n 
Hence, 
2, 2 n 
t)= j1- +0 . 
XO) | = c) 
Let 
t? t? 
=-— ~~ +o|— 
2 2n -| 
2 


sa 


—1/2+0 (1) 


p(t) = [a + ay S] 


= [a +4 a) e] “eja F Pro 


As n> ©, w, > 0 and 


-1/2 
| >et n2, 


[at o) 


i =o 


[+ o,)°/ 
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Thus, 
$,(t)re' 7? asn>~, 


(b) e~' ” is the characteristic function of the standard normal distri- 
bution Z. Hence, by Theorem 11.7.3, as n > ©, s* > Z. 


CHAPTER 12 


12.1. (a) 


h n-1 
= 7 (bgt + ten +2 X log 
i=1 


ees 2S logi 
= + 
n= 1) ogn X ogi 
= tlogn + log2 + log3 + = +log(n — 1) 
= tlogn +log(n!) — logn 
= log(n!) — Slog n. 
(b) nlogn =n + 1 = log(n!) — Slog n. Hence, 
log(n!) =(n+3)logn—-n+1 
n! = exp|(n + 3)logn—n + 1] 


—n n+1/2 


=ee "n =e ttl ynti/2. 


12.2. Applying formula (12.10), we obtain the following approximate values 
for the integral [jdx/( + x): 


n Approximation 
2 0.69325395 
4 0.69315450 
8 0.69314759 
16 0.69314708 


The exact value of the integral is log 2 = 0.69314718. Using formula 
(12.12), the error of approximation is less than or equal to 


M,(b— a)” M, 
2880n*  2880(8)* 
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where M, is an upper bound on | f(x)| for 0 <x <1. Here, f(x) = 
1/ +x), and 


FO(x) = 
x) = ——. 
(1 +x) 
Hence, M, = 24, and 
M, 
———,, = 0.000002. 
2880(8) 
12.3. Let 
20— 7/2 T 
z= 5 0<0< >, 
T/2 2 
0= —(1+z), —1<z<1 
Then 
2 T 
f “sin 0 d0= Ff sin] Za +2)| dz 
ap 2 
ase x wj sinf30 +2)| 
Using the information in Example 12.4.1, we have zy) = — y 3 z,=0, 


ae 5 8 5 
z =V}; o 5, @1 = 9, @ = 3. Hence, 


= 1.0000081. 


The error of approximation associated with f1; sin[(r/4(1 + z)]dz is 
obtained by applying formula (12.16) with a= —1, b=1, f(z)= 
sin(7/4(1 +z), n=2. Thus, fC é) = —(7/4)*’sin[(7/40 + £), 
—1< &<1, and the error is therefore less than 


ey 


6N? \4 
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Consequently, the error associated with /7”/*sin 0 d0 is less than 


ser a) -(G) ae 
= 0.0000117. 


12.4. (a) The inequality is obvious from the hint. 
(b) Let m =3. Then, 


ee ae Soe 
—e"™" =e” 
m 3 

= 0.0000411. 


(c) Apply Simpson’s method on [0,3] with 4 < 0.2. Here, 
fO(x) = (12 — 48x? + 16x4)e >, 


with a maximum of 12 at x = 0 over [0,3]. Hence, M, = 12. Using 
the fact that h = (b — a)/2n, we get n = (b — a)/2h > 3/0.4 = 7.5. 
Formula (12.12), with n = 8, gives 


nM,h>  8(12)(0.2)* 
< 
90 90 
= 0.00034. 


Hence, the total error of approximation for fọ e™ dx is less than 
0.0000411 + 0.00034 = 0.00038. This makes the approximation 
correct to three decimal places. 


12.5. 
œ% x æ% xe™ 
i, esa o l+e aog 
n x; 
s , 


where the x,’s are the zeros of the Laguerre polynomial L,,, ,(x). 
Using the entries in Table 12.1 for n = 1, we get 


© x 0.58579 
f, ge ai dx = 0.85355 1 + e~ 2058579) _ o~ 0.58579 


3.41421 


+0.14645 1 + e 264D p73 =1.18. 
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12.6. (a) Let u=(2t—x)/x. Then 


I(x) T a d 
x)= u 
23-1 1+ 1x? (u +1) 


1 1 
=2x] ——————— du 
E 4+x?°(u +1) 
(b) Applying a Gauss—Legendre approximation of I(x) with n = 4, we 
obtain 
4 


(x)=2x} 


ino 4+2?(u; +1) 


O; 


where Uo, Uy, Uz, U3, and u, are the five zeros of the Legendre 
polynomial P;(u) of degree 5. Values of these zeros and the 
corresponding w,’s are given below: 


Uy = 0.90617985, w, = 0.23692689, 
u, = 0.53846931,  w, = 0.47862867, 
u, = 0, w, = 0.56888889, 
= —0.53846931, œ = 0.47862867, 
u, = —0.90617985, œ, = 0.23692689, 


= 
Ww 
| 


(see Table 7.5 in Shoup, 1984, page 202). 


12.7. fè(cos x)* dx = fè e*!8°°S* dx. Using the method of Laplace, the func- 
tion A(x) = logcos x has a single maximum in [0,1] at x = 0. Formula 
(12.28) gives 


[rere ay ~ ZN ie 
0 2 ah" (0) 
as A — ©, where 

ee cos? x |x=0 


=-1, 


f erse dy ~ T f 
0 \ 2X 


Hence, 
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12.9. 


a- A n o A T 
fia 2) de, de, = = 


= ().52359878. 


Applying the 16-point Gauss—Legendre rule to both formulas gives 
the approximate value 0.52362038. 
[ Note: For more details, see Stroud (1971, Section 1.6).] 


12.10. 
fa +x) dx = fone 
Apply the method of Laplace with g(x) = 1 and 
h(x) = —log(1 +x°). 


This function has a single maximum in [0,%) at x = 0. Using formula 
(12.28), we get 


poaa] 


where A” (0) = —2. Hence, 


oœ —n T 1 TE 
1+x? ~| — =>>. 
TA ADA 4n 2Vn 


12.11. 
Number of Approximate Value 
Quadrature Points of Variance 
2 0.1175 
4 0.2380 
8 0.2341 
16 0.1612 
32 0.1379 
64 0.1372 


[Source: Example 16.6.1 in Lange (1999).] 


12.12. If x > 0, then 
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Using Maclaurin’s series (see Section 4.3), we obtain 


2 


1 1 x 1 i 1 t? 
WES ae |, 1+5(-5]+5(-3] 


1 py ey 
—|-— +—|-—| + jdt 
3! 2 n! 2 
1 1 t t5 t’ 
= + t— + > -— ; 
2 V2 3x2~x 1! 5X2 X21 7X2°x3! 
pont % 
+e + N +e 
(A (2n+1) X2”" Xn! | 
1 x x? x4 xê 
=- + 1- + - - > 
2 V2T 3x2x1! 5X27 x2! 7X2°x3! 


+++ +(-1)" x tl 


(2n+1)x2”xn! 


12.13. From Exercise 12.12 we need to find values of a,b,c,d such that 


xt xê x8 | 


- + 
40 336 3456 


2 
x 
Larè tbt = (Lex? +a) an 


Equating the coefficients of x’, x*,x°,x® on both sides yields four 
equations in a, b,c, d. Solving these equations, we get 


17 739 95 55 


—, b= , c=, d=. 
468 196560 468 4368 


[ Note: For more details, see Morland (1998).] 
12.14. 
y=G(x) = [eto dt 
0 


1 
; 0<x<2. 


The only solution of y = G(x) in [0,2] is x= -1 + (1 +8y)}/7, 0 <y 
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650 
<1. A random sample of size 150 from the g(x) distribution is given 
by 
x,=—-1+(1+8y,)'”’,  i=1,2,...,150, 
where the y;s form a random sample from the U(0, 1) distribution. 
Applying formula (12.43), we get 
‘ 4 150 px? 
[> = 
10 150 2 1+x; 
= 16.6572. 
Using now formula (12.36), we obtain 
` 2 150 | 
Lis = Mi 
150 = T50 L e 
= 17.8878, 
where the u,’s form a random sample from the U(0, 2) distribution. 
12.15. As n > %, 
2 770+1)/2 2777/2 y? -1/2 
1+ — =/1+ — 1+ — 
n n n 
~e ax? /2 A 


Furthermore, by applying Stirling’s formula (formula 12.31), we obtain 


i n-1 (n—1)/2 n-—1 1/2 
~eze] | Lax 1 , 


eE Mee y 
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Hence, 
n+1 
DN 1 1 [watt n-2 fn-1 
vant (>) vnm e'/? |n-2 n-1 |n-2 
2 2 
1 1 (1-13) n-2 
vnw el? (72)? nel 
2 
1 1 e m 
mr e enVa 
1 
Qa 
Hence, for large n, 
F(x) = =R et? dt 
V2T — o0 l 
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